Mailing List Archive

[PATCH] enable port accesses with (almost) full register context
This helped HP getting certain system management software going (in
dom0) that triggers SMIs and depends upon other than port number
and data register values being visible to the SMI handler.

Signed-off-by: Jan Beulich <jbeulich@novell.com>
Re: [PATCH] enable port accesses with (almost) full register context [ In reply to ]
On 11/9/06 5:12 pm, "Jan Beulich" <jbeulich@novell.com> wrote:

> This helped HP getting certain system management software going (in
> dom0) that triggers SMIs and depends upon other than port number
> and data register values being visible to the SMI handler.

That's quite rough. The 'special' handlers do more than just register
restore/save: what's all the locking and other assorted bits and pieces
doing in there? The 'special/normal' distinction at the interface is (I
suppose to some extent unavoidably) ugly and non-obvious.

Would it be cleaner to allow dom0 to have really direct access to some I/O
ports by allowing it to set a real I/O bitmap? I implemented I/O bitmaps via
emulation mainly because it makes context switching faster and it is less of
a pain to keep admin and guest bitmasks in sync if they are checked
synchronously. But a direct dom0-only bitmap would be a bit easier: quick to
turn on/off and no need to sync with admin bitmaps. Main downside is that
it'll slow down context-switch paths a little bit.

-- Keir



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel
Re: [PATCH] enable port accesses with (almost) full register context [ In reply to ]
>>> Keir Fraser <Keir.Fraser@cl.cam.ac.uk> 11.09.06 18:19 >>>
>On 11/9/06 5:12 pm, "Jan Beulich" <jbeulich@novell.com> wrote:
>
>> This helped HP getting certain system management software going (in
>> dom0) that triggers SMIs and depends upon other than port number
>> and data register values being visible to the SMI handler.
>
>That's quite rough. The 'special' handlers do more than just register
>restore/save: what's all the locking and other assorted bits and pieces
>doing in there? The 'special/normal' distinction at the interface is (I
>suppose to some extent unavoidably) ugly and non-obvious.

That is because of the self modifying code that needs proper MP
synchronization. I know it's looking ugly, but I considered this the most
reasonable approach.
I'm not sure I understand what ugliness you find in the special/normal
distinction logic; one thing I'm thinking of is the additional meaning
added to the hypercall interface - I simply didn't want to introduce a
new sub-function there, especially since the existing one provided
ample room for the needed addition. But certainly, if you want that
changed, should be easily doable (even without significantly affecting
HP's code already utilizing the interface as we added it to our 3.0.2).

>Would it be cleaner to allow dom0 to have really direct access to some I/O
>ports by allowing it to set a real I/O bitmap? I implemented I/O bitmaps via
>emulation mainly because it makes context switching faster and it is less of
>a pain to keep admin and guest bitmasks in sync if they are checked
>synchronously. But a direct dom0-only bitmap would be a bit easier: quick to
>turn on/off and no need to sync with admin bitmaps. Main downside is that
>it'll slow down context-switch paths a little bit.

I considered that too, but rejected it because of opening these ports to
vm86 mode then, too (as I/O instructions are *not* susceptible to iopl there,
they only depend on the bitmap).

Jan

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel
Re: [PATCH] enable port accesses with (almost) full register context [ In reply to ]
On 12/9/06 8:15 am, "Jan Beulich" <jbeulich@novell.com> wrote:

> That is because of the self modifying code that needs proper MP
> synchronization. I know it's looking ugly, but I considered this the most
> reasonable approach.

Why is more synchonisation needed for emulation of these SMI port accesses
than you'd have for direct execution? I.e., if the accesses were executed
natively on an SMP system there'd be none of the extra synchronisation you
added happening. The instructions would be directly executed.

> I considered that too, but rejected it because of opening these ports to
> vm86 mode then, too (as I/O instructions are *not* susceptible to iopl there,
> they only depend on the bitmap).

I/O bitmap always overrides IOPL, in every execution mode. Why is vm86 mode
a particular concern? I was thinking that dom0 would switch on the direct
bitmap access only for the process(es) that requested it. We wouldn't want
direct access to be available to every process in dom0.

Not that I'm certain direct access is better than 'special emulation'. But
I'm not applying the existing patch unless I understand exactly why it needs
to do everything that it does. I'm in no rush -- supporting some piece of HP
closed-source management software isn't top priority for us, I'd say.

-- Keir



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel
Re: [PATCH] enable port accesses with (almost) full register context [ In reply to ]
>>> Keir Fraser <Keir.Fraser@cl.cam.ac.uk> 12.09.06 09:53 >>>
>On 12/9/06 8:15 am, "Jan Beulich" <jbeulich@novell.com> wrote:
>
>> That is because of the self modifying code that needs proper MP
>> synchronization. I know it's looking ugly, but I considered this the most
>> reasonable approach.
>
>Why is more synchonisation needed for emulation of these SMI port accesses
>than you'd have for direct execution? I.e., if the accesses were executed
>natively on an SMP system there'd be none of the extra synchronisation you
>added happening. The instructions would be directly executed.

Again, I'm using self-modifying code there (to store the port number, as I
can't reliably use %dx for it if the original instruction happened to be one
with immediate operand, and %edx/%rdx happens to carry relevant data
for the SMI handler), which is what needs synchronization.

>> I considered that too, but rejected it because of opening these ports to
>> vm86 mode then, too (as I/O instructions are *not* susceptible to iopl there,
>> they only depend on the bitmap).
>
>I/O bitmap always overrides IOPL, in every execution mode. Why is vm86 mode
>a particular concern? I was thinking that dom0 would switch on the direct

You're right, of course - all modes are relevant here.

>bitmap access only for the process(es) that requested it. We wouldn't want
>direct access to be available to every process in dom0.

True. With that I agree installing the bitmap in the TSS would allow solving
the problem, too. Still I think the necessary overhead (you'd need to copy
the bitmap and keep it sync-ed, or make it read-only, for the direct access
to not be abusable) would be larger than using the special access method.

>Not that I'm certain direct access is better than 'special emulation'. But
>I'm not applying the existing patch unless I understand exactly why it needs
>to do everything that it does. I'm in no rush -- supporting some piece of HP
>closed-source management software isn't top priority for us, I'd say.

Which I can easily understand; nevertheless I seem to recall that we had
talked about the issue when it was first brought up (at least 3 months back),
and you seemed in agreement that the nature of the problem warrants a fix.

Jan

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel
Re: [PATCH] enable port accesses with (almost) full register context [ In reply to ]
On 12/9/06 10:03, "Jan Beulich" <jbeulich@novell.com> wrote:

> Again, I'm using self-modifying code there (to store the port number, as I
> can't reliably use %dx for it if the original instruction happened to be one
> with immediate operand, and %edx/%rdx happens to carry relevant data
> for the SMI handler), which is what needs synchronization.

Ok, I see. I think it would be neater to build the code on the stack, or
some other per-cpu area, and avoid the synchronisation. We have no plans to
use the PAGE_NX flag in Xen itself, and x86/64 already has stack
trampolines. Perhaps the register save/restore code could be tidied too,
since it's not performance critical. It's not at all uniform like I'd
expect, with those interleaved push/pop/mov instructions. How about
something more like:
pushad; call restore_guest_regs; <I/o port access>; popad
Where restore_guest_regs takes a regparm, and (obviously) restores the
regparm register last. I'd only do it as a call because it'd be ugly to
dynamically build that amount of code.

I'm not sure about the full extent of the interface changes either. How
about we add a new sysctl for specifying ports which need 'direct
execution'. It makes sense to make it a sysctl because this is a property of
the I/O port (or assumptions about it encoded in the platform firmware)
rather than a per-domain issue, or something that I think should be visible
at the physdev_op interface.

We'd test the per-port direct-execution flag for any port access by any
domain. After all, the only reason we don't use the new code for *all* port
accesses is concern about performance. I think calling this 'direct
execution' versus 'emulation' at the interface is fair -- even though we
emulate in all cases, in the former case it will be Xen's responsibility to
do all that is necessary to make it appear to the BIOS that the instruction
was executed directly, as when running natively.

-- Keir



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel
Re: [PATCH] enable port accesses with (almost) full register context [ In reply to ]
>>> Keir Fraser <Keir.Fraser@cl.cam.ac.uk> 12.09.06 11:50 >>>
>On 12/9/06 10:03, "Jan Beulich" <jbeulich@novell.com> wrote:
>
>> Again, I'm using self-modifying code there (to store the port number, as I
>> can't reliably use %dx for it if the original instruction happened to be one
>> with immediate operand, and %edx/%rdx happens to carry relevant data
>> for the SMI handler), which is what needs synchronization.
>
>Ok, I see. I think it would be neater to build the code on the stack, or
>some other per-cpu area, and avoid the synchronisation. We have no plans to
>use the PAGE_NX flag in Xen itself, and x86/64 already has stack
>trampolines. Perhaps the register save/restore code could be tidied too,
>since it's not performance critical. It's not at all uniform like I'd
>expect, with those interleaved push/pop/mov instructions. How about
>something more like:
> pushad; call restore_guest_regs; <I/o port access>; popad
>Where restore_guest_regs takes a regparm, and (obviously) restores the
>regparm register last. I'd only do it as a call because it'd be ugly to
>dynamically build that amount of code.

Hm, I don't like this on-the-fly building of code very much, and I also don't
like writing assembly code that can obviously written to perform better. Also,
on 64-bits the code wouldn't look so much nicer since there's no {push,pop}ad.
But certainly, if you refuse to take the patch without changing that...

>I'm not sure about the full extent of the interface changes either. How
>about we add a new sysctl for specifying ports which need 'direct
>execution'. It makes sense to make it a sysctl because this is a property of
>the I/O port (or assumptions about it encoded in the platform firmware)
>rather than a per-domain issue, or something that I think should be visible
>at the physdev_op interface.
>
>We'd test the per-port direct-execution flag for any port access by any
>domain. After all, the only reason we don't use the new code for *all* port
>accesses is concern about performance. I think calling this 'direct
>execution' versus 'emulation' at the interface is fair -- even though we
>emulate in all cases, in the former case it will be Xen's responsibility to
>do all that is necessary to make it appear to the BIOS that the instruction
>was executed directly, as when running natively.

That sounds right (and better than the current way). I'll do that change,
though I guess I'd still not call it direct execution.

Jan

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel
Re: [PATCH] enable port accesses with (almost) full register context [ In reply to ]
On 12/9/06 11:32, "Jan Beulich" <jbeulich@novell.com> wrote:

> Hm, I don't like this on-the-fly building of code very much, and I also don't
> like writing assembly code that can obviously written to perform better. Also,
> on 64-bits the code wouldn't look so much nicer since there's no {push,pop}ad.
> But certainly, if you refuse to take the patch without changing that...

IMO you're doing code building anyway, but just of one instruction. You get
rid of the locking by doing it to a per-CPU buffer, and the stack is the
obvious place, calling out to register save/restore code. I don't really
care about the performance of the save/restore code -- it's obviously going
to be trivial compared with the unavoidable trap-and-emulate cost. Also, do
you need separate save/restore code for IN vs. OUT instructions?

Something like:
call save_host_restore_guest
<IN or OUT>
call save_guest_restore_host
ret

Would that be reasonable?

> That sounds right (and better than the current way). I'll do that change,
> though I guess I'd still not call it direct execution.

'Special' is a crappy description because it's so non-specific. How about
'BIOS' ports? I can't think of any reason that emulating these accesses
could be a problem, except that BIOS/firmware is trapping them and expecting
more context than the hardware instruction defines as being required.

Alternatively, perhaps we could get rid of the distinction and emulate all
port accesses in this way? I suspect that the cost of state save/restore and
building the trampoline is dwarfed by the cost of the GPF and even the cost
of the I/O port access itself (they don't tend to be super fast). Could you
do a few quick measurements to determine this? If the extra cost is less
than, say, 10%, I'd be inclined to take the hit to avoid interface changes.

-- Keir



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel
Re: [PATCH] enable port accesses with (almost) full register context [ In reply to ]
>IMO you're doing code building anyway, but just of one instruction. You get
>rid of the locking by doing it to a per-CPU buffer, and the stack is the
>obvious place, calling out to register save/restore code. I don't really
>care about the performance of the save/restore code -- it's obviously going
>to be trivial compared with the unavoidable trap-and-emulate cost. Also, do
>you need separate save/restore code for IN vs. OUT instructions?

Actually, in the code I currently have I do. This is because for out-s I need
to merge the value output with the user-specified rAX, under the
assumption that output value and register contents are not always identical
(i.e. if particular bits within a port would need to be special treated by Xen,
which I can easily imagine to be required at some point).

>Something like:
> call save_host_restore_guest
> <IN or OUT>
> call save_guest_restore_host
> ret
>
>Would that be reasonable?

It would, provided the above assumption about the need to modify the
output value would never become true. Additionally, for 64-bits, I'm
concerned about the potential need for using indirect calls here (as well
as in the syscall trampolines): there's nothing keeping a user from making
the Xen heap 2Gb or more in size. These would further slow things down,
but depending on the nature of allocations made from the Xen heap it
may also be possible to simply place an upper limit on the heap size, as
it currently is assumed adjacent to the Xen image (but taking memory
holes at rather low addresses into account a user may even be required
to bump the heap size significantly - what if only a few Mb of memory
below 4Gb existed? - since, after all, the heap size is the size of address
space consumed, not the amount of memory used).

>Alternatively, perhaps we could get rid of the distinction and emulate all
>port accesses in this way? I suspect that the cost of state save/restore and
>building the trampoline is dwarfed by the cost of the GPF and even the cost
>of the I/O port access itself (they don't tend to be super fast). Could you
>do a few quick measurements to determine this? If the extra cost is less
>than, say, 10%, I'd be inclined to take the hit to avoid interface changes.

Percentages of full-context relative to simply emulated i/o, without having
changed the assembly file approach to the stub building one, yet (as per
above issues):

PentiumIII (32-bit) with locking 67%
PentiumIII (32-bit) without locking 84%
Pentium4 (64-bit) with locking 86%
Pentium4 (64-bit) without locking 89%

Revised patch (domctl->sysctl, naming) attached.

Jan
Re: [PATCH] enable port accesses with (almost) full register context [ In reply to ]
On 13/9/06 10:46, "Jan Beulich" <jbeulich@novell.com> wrote:

> It would, provided the above assumption about the need to modify the
> output value would never become true.

I hope it doesn't. :-) We'll cross this bridge if we come to it.

> Additionally, for 64-bits, I'm
> concerned about the potential need for using indirect calls here (as well
> as in the syscall trampolines): there's nothing keeping a user from making
> the Xen heap 2Gb or more in size.

Not much of a concern. Perhaps we should clamp the heap_size parameter to
2GB as a short-term fix for this issue. As you say, it can also affect the
syscall trampolines so users would soon notice if this was broken!

When we merge Xen and domain heaps on x86/64, we'll probably require Xen
allocations to come from a zone <= 2GB. Xen doesn't allocate much memory, so
that's not going to be a particularly serious constraint.

> Percentages of full-context relative to simply emulated i/o, without having
> changed the assembly file approach to the stub building one, yet (as per
> above issues):
>
> PentiumIII (32-bit) with locking 67%
> PentiumIII (32-bit) without locking 84%
> Pentium4 (64-bit) with locking 86%
> Pentium4 (64-bit) without locking 89%

A little bit higher overhead than I'd hoped, but not terrible. Let's see how
it looks with the stub-building method, and then decide whether to bother
with the sysctl interface. Perhaps highly-optimised assembly code
save/restore routines will be required after all. :-)

Cheers,
Keir

> Revised patch (domctl->sysctl, naming) attached.



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel
Re: [PATCH] enable port accesses with (almost) full register context [ In reply to ]
>>> Keir Fraser <Keir.Fraser@cl.cam.ac.uk> 13.09.06 14:10 >>>
>On 13/9/06 10:46, "Jan Beulich" <jbeulich@novell.com> wrote:
>
>> It would, provided the above assumption about the need to modify the
>> output value would never become true.
>
>I hope it doesn't. :-) We'll cross this bridge if we come to it.

It'll be immediately needed if string I/O instructions are to also go that
path, unless you'd want them to access the original user buffer (and
trap the eventual page fault).

Also, I might need a little more clarification on the stack (ab)use for
creating stubs: As I understand it, the double-fault and NMI stacks on
x86-64 are currently simply overlaid on top of the normal stack,
basically assuming you'd never use this much space (the one-page
non-present separator is inserted only in debug builds). (Side note:
While for normal operations this is fine, I question the value of a
double fault backtrace that might be created due to a stack overflow
on a non-debug build. The obvious question is why the separator hole
isn't always being created - after all this is a one time operation that
happens as CPUs get brought up, so there shouldn't be any
performance overhead.)

Anyway, the relationship to the stubs is that I would favor moving the
stubs onto the double fault stack itself (rather than adjacent to the NMI
stack, which in turn is adjacent to the double fault one), because
(a) the stubs won't be needed anymore once the double fault stack is
needed and
(b) the stubs are this way farther away from the normal stack, making
it less likely for difficult to debug problems to crop in. I would then
similarly put the 32-bit I/O stubs onto the (top of the) (would-be
double fault) stack (which should be per CPU as much as on 64-bits,
but I realize that would imply per-CPU double fault TSSes and hence
per-CPU GDTs).

Jan

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel
Re: [PATCH] enable port accesses with (almost) full register context [ In reply to ]
On 18/9/06 11:40, "Jan Beulich" <jbeulich@novell.com> wrote:

> It'll be immediately needed if string I/O instructions are to also go that
> path, unless you'd want them to access the original user buffer (and
> trap the eventual page fault).

We emulate INS/OUTS as a sequence of IN/OUT plus copy_to/from_guest. Unless
the SMM code depends on us not having 'clobbered' %eax (which we would need
to do to emulate OUTS with OUT) then we should be okay there. I guess how
complicated the save/restore code needs to be depends on just how accurately
we need to set up the register state for this HP SMM code -- for example, I
guess we get away with SS:ESP being incorrect; can we get away with EAX as
well? Hmm... I guess you have made your point that there are devils in the
detail of doing this emulation. ;-)

> Also, I might need a little more clarification on the stack (ab)use for
> creating stubs.

Just declare a char-array automatic variable, fill it with machine code, and
call it.

-- Keir



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel
Re: [PATCH] enable port accesses with (almost) full register context [ In reply to ]
>> Also, I might need a little more clarification on the stack (ab)use for
>> creating stubs.
>
>Just declare a char-array automatic variable, fill it with machine code, and
>call it.

Actually, I rather wanted to do static setup as much as possible and hence
leave only the filling of the actual opcode to be done dynamically (at the
price of inserting one or two nops).

Jan

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel
Re: [PATCH] enable port accesses with (almost) full register context [ In reply to ]
On 18/9/06 12:36, "Jan Beulich" <jbeulich@novell.com> wrote:

>> Just declare a char-array automatic variable, fill it with machine code, and
>> call it.
>
> Actually, I rather wanted to do static setup as much as possible and hence
> leave only the filling of the actual opcode to be done dynamically (at the
> price of inserting one or two nops).

I think putting the static code in assembly functions and calling out to
them from a dynamically-generated stub of machine code would be neatest.

It doesn't take much C code to generate:
call prologue; in/out; jmp epilogue

That's only about 12 bytes of generated code (assuming call/jmp rel32).
Static calls/jumps are very cheap.

You can define the prologue/epilogue functions within the same .c file
inside globally-defined asm() blocks.

-- Keir



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel
Re: [PATCH] enable port accesses with (almost) full register context [ In reply to ]
>>> Keir Fraser <Keir.Fraser@cl.cam.ac.uk> 12.09.06 13:28 >>>
>On 12/9/06 11:32, "Jan Beulich" <jbeulich@novell.com> wrote:
>
>> Hm, I don't like this on-the-fly building of code very much, and I also don't
>> like writing assembly code that can obviously written to perform better. Also,
>> on 64-bits the code wouldn't look so much nicer since there's no {push,pop}ad.
>> But certainly, if you refuse to take the patch without changing that...
>
>IMO you're doing code building anyway, but just of one instruction. You get
>rid of the locking by doing it to a per-CPU buffer, and the stack is the
>obvious place, calling out to register save/restore code. I don't really
>care about the performance of the save/restore code -- it's obviously going
>to be trivial compared with the unavoidable trap-and-emulate cost. Also, do
>you need separate save/restore code for IN vs. OUT instructions?
>
>Something like:
> call save_host_restore_guest
> <IN or OUT>
> call save_guest_restore_host
> ret
>
>Would that be reasonable?

Attaching the revised patch.

>> That sounds right (and better than the current way). I'll do that change,
>> though I guess I'd still not call it direct execution.
>
>'Special' is a crappy description because it's so non-specific. How about
>'BIOS' ports? I can't think of any reason that emulating these accesses
>could be a problem, except that BIOS/firmware is trapping them and expecting
>more context than the hardware instruction defines as being required.
>
>Alternatively, perhaps we could get rid of the distinction and emulate all
>port accesses in this way? I suspect that the cost of state save/restore and
>building the trampoline is dwarfed by the cost of the GPF and even the cost
>of the I/O port access itself (they don't tend to be super fast). Could you
>do a few quick measurements to determine this? If the extra cost is less
>than, say, 10%, I'd be inclined to take the hit to avoid interface changes.

The new measurement results (full context compared to normal emulation):

PentiumIII (32-bit) 88%
Pentium4 (64-bit) 90%

Jan