Mailing List Archive

[Patch] Fix for x86_64 boot failures due to bad segment setup for protected mode.
Hi,

We have been debugging a nasty VMX issue recently, where certain kernel
builds would fail to boot under VMX from syslinux, whereas others
booted fine and the same kernel booted OK under grub. In practice, this
means that we have only ever seen this on bootable ISO images.

The underlying problem turned out to be vmxassist mishandling the
setting up of segment registers on entry into VM86_PROTECTED mode. The
bug occurs when the guest is entering protected mode and reaches the far
jump to set up the 32-bit segment registers for the virtual VMENTER
we're about to perform. vmxassist, in protected_mode(), looks at the
segment registers, which might be 16-bit segments or might be 32-bit
segment selectors at this stage (depending on whether the OS has
reloaded them since entering protected mode); and it tries to load the
segments from the GDT. Unconditionally. Even if the segment register
still actually contains a real mode 16-bit segment. Whoops.


Now, enter the *second* bug, this time in the main Linux kernel itself.
On x86_64, the kernel boot sequence sets up a bogus GDT:
arch/x86_64/boot/setup.S has

gdt_48:
.word 0x8000 # gdt limit=2048,
# 256 GDT entries

.word 0, 0 # gdt base (filled in later)

which shows that we think we have a 2048-byte GDT, with 256 entries
(that all adds up fine); but 2048 is 0x800, not 0x8000. So when we do
an lgdt to set this gdt up, we are making it 16 times too big.


Unfortunately, when we enter this code from syslinux, SS has a 16-bit
value still, 0x3000. That should be way off the end of the GDT and
hence illegal as a descriptor even if we mistakenly tried to load it,
but because the GDT has been loaded with the wrong size, vmxassist
thinks that 0x3000 *IS* a valid segment descriptor, and loads it into
the VMCS for the guest's protected mode VMENTER.

And so, if, by chance, the 8 bytes at (GDT+0x3000) in the kernel image
pass the VMENTER instruction's simple sanity tests, we ignore the
problem and shortly afterwards the kernel will load up a valid SS; but
if we fail those sanity tests then the VMENTER fails with "bad guest
state". It's just luck whether a given vmlinuz works or not.


The reason that some kernels show this problem and others do not under
Xen is because of the 0x8000 GDT-size kernel bug. But the blame lies
squarely with Xen, because the kernel has never loaded any segments from
the undefined GDT area above 0x800, and yet vmxassist tried to set up a
VMCS segment from it anyway.

So, while we would still like to fix the kernel GDT, the *real* problem
here is in vmxassist's mishandling of segments during the move to
protected mode.


Now, vmxassist already has code to detect 16-bit segments that survived
unmodified from a transition into and out of protected mode, and to save
and restore those appropriately. It does this using "saved_rm_regs",
which get cleared on entry to protected mode, and then set to the old
segment value if we fail to set a given 32-bit segment correctly.

The fix is to save the 16-bit segments *always*, on entry to protected
mode when %CR0(PE) is first set; and to clear the saved 16-bit segment
and set the 32-bit variant in oldctx whenever a 32-bit segment
descriptor is set during the transition to 32-bit CS. Then, when we
finally do the VMENTER, we will set up the VMCS from only the 32-bit
segments, clearing the VMCS entries for segments that have not been
assigned valid 32-bit segments yet.

Tested on various RHEL-5 boot.isos, including ones which worked before
and ones which triggered the bug; all now boot correctly.

Signed-off-by: Stephen Tweedie <sct@redhat.com>
Re: [Patch] Fix for x86_64 boot failures due to bad segment setup for protected mode. [ In reply to ]
On 9/11/06 3:49 am, "Stephen C. Tweedie" <sct@redhat.com> wrote:

> The fix is to save the 16-bit segments *always*, on entry to protected
> mode when %CR0(PE) is first set; and to clear the saved 16-bit segment
> and set the 32-bit variant in oldctx whenever a 32-bit segment
> descriptor is set during the transition to 32-bit CS. Then, when we
> finally do the VMENTER, we will set up the VMCS from only the 32-bit
> segments, clearing the VMCS entries for segments that have not been
> assigned valid 32-bit segments yet.

So, after setting CR0.PE but before doing a jump to complete the transition
to protected mode, is the guest now running with zeroed data selectors but
with the old 'shadow segment state'? So it can still use its data segments
until the long jump? Not that I know whether any bootloader would actually
*want* to access data in that critical region....

-- Keir



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel
Re: [Patch] Fix for x86_64 boot failures due to bad segment setup for protected mode. [ In reply to ]
Hi,

On Thu, 2006-11-09 at 08:56 +0000, Keir Fraser wrote:
> On 9/11/06 3:49 am, "Stephen C. Tweedie" <sct@redhat.com> wrote:
>
> > The fix is to save the 16-bit segments *always*, on entry to protected
> > mode when %CR0(PE) is first set; and to clear the saved 16-bit segment
> > and set the 32-bit variant in oldctx whenever a 32-bit segment
> > descriptor is set during the transition to 32-bit CS. Then, when we
> > finally do the VMENTER, we will set up the VMCS from only the 32-bit
> > segments, clearing the VMCS entries for segments that have not been
> > assigned valid 32-bit segments yet.
>
> So, after setting CR0.PE but before doing a jump to complete the transition
> to protected mode, is the guest now running with zeroed data selectors but
> with the old 'shadow segment state'?

I tried to change the code as little as possible to minimise
regressions, as I don't have the full battery of guest OSes to test.

But the intention at least was that the guest shouldn't change much as a
result of the patch. We're still running entirely in vmxassist
VM86_REAL_TO_PROTECTED mode during the change; the actual vmcs does not
get changed until we go to VM86_PROTECTED mode (and not at all if we
return to real mode before getting that far.)

So the guest selectors as far as the CPU is concerned are still the
selectors we set up to run vmxassist; and the selectors in effect as far
as the emulated guest OS code is concerned remains the old 16-bit
segments that vmxassist uses during that emulation.

The numerical offset that the guest sees as a selector when reading
es/ds etc. should not change, as the patch doesn't touch regs at all.

This actually highlights the fact that vmxassist's emulation for real-
to-protected mode is pretty minimal, and in fact it passes the buck to
VMENTER earlier than it should. It does so as soon as CS goes 32-bit,
but the other registers are still 16 bit and there's no theoretical
reason why a guest couldn't try to access 16-bit data/stack segments
from a 32-bit CS. I haven't touched any of that --- the logic is still
that when we drop to true 32-bit VMENTER, segments that were 16-bit are
cleared. I just changed the logic to make sure that we are better at
detecting which segments still contain 16-bit descriptors, and which
ones have been reloaded from the GDT and hence should survive into 32-
bit mode.

The fact that each segment has multiple different interpretations here
--- in the vmxassist context itself, in the guest VMCS, and in the guest
visible register set, doesn't help the discussion so I may have
misunderstood your question or been unclear in my reply!

Cheers,
Stephen



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel
Re: [Patch] Fix for x86_64 boot failures due to bad segment setup for protected mode. [ In reply to ]
Keir Fraser wrote:
> On 9/11/06 3:49 am, "Stephen C. Tweedie" <sct@redhat.com> wrote:
>
>> The fix is to save the 16-bit segments *always*, on entry to protected
>> mode when %CR0(PE) is first set; and to clear the saved 16-bit segment
>> and set the 32-bit variant in oldctx whenever a 32-bit segment
>> descriptor is set during the transition to 32-bit CS. Then, when we
>> finally do the VMENTER, we will set up the VMCS from only the 32-bit
>> segments, clearing the VMCS entries for segments that have not been
>> assigned valid 32-bit segments yet.
>
> So, after setting CR0.PE but before doing a jump to complete the transition
> to protected mode, is the guest now running with zeroed data selectors but
> with the old 'shadow segment state'? So it can still use its data segments
> until the long jump? Not that I know whether any bootloader would actually
> *want* to access data in that critical region....
>
>

Keir,

oldctx.*_sel is basically the shadow selector. They are not used in
emulating the code during the single step. So zeroing them out is fine.
The regs structure is what is used in calculating getting data in the
data segments.

The main thing here that Stephen's approach is different to the previous
approach, is that if a segment descriptor *is* changed (ie. %es for
Vista) then we have to update the oldctx explicitly for that selector.
So far the only place we see this is with the "pop %es". And Stephen
put in the proper code to update that descriptor.

The old approach would just update all the selectors if they fit inside
the GDT table, whether or not they were changed or loaded. So when Linux
had a bad (extremely large) GDT limit, we loaded all the selectors into
the shadow selector, even though they were bogus.

-- Steve


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel
Re: [Patch] Fix for x86_64 boot failures due to bad segment setup for protected mode. [ In reply to ]
Keir Fraser wrote:
> On 9/11/06 12:56, "Steven Rostedt" <srostedt@redhat.com> wrote:
>
>> The old approach would just update all the selectors if they fit inside
>> the GDT table, whether or not they were changed or loaded. So when Linux
>> had a bad (extremely large) GDT limit, we loaded all the selectors into
>> the shadow selector, even though they were bogus.
>
> A smaller patch might have been to implement those VMENTER sanity checks
> inside load_seg(). I'm sure they're quite simple. However, this new approach
> is arguably saner than the old, and it's worked on a few images I've tested
> it with, so I checked it in and we'll see how it holds up under wider
> testing.
>

I believed Stephen looked at the sanity check approach, and determined
that the code to simulate the sanity checks would be quite complex and
error prone. But Stephen can correct me if I'm wrong.

-- Steve



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel
Re: [Patch] Fix for x86_64 boot failures due to bad segment setup for protected mode. [ In reply to ]
Hi,

On Thu, 2006-11-09 at 14:08 +0000, Keir Fraser wrote:

> A smaller patch might have been to implement those VMENTER sanity checks
> inside load_seg(). I'm sure they're quite simple.

I checked, as that was going to be my first approach. Turns out that
they are not actually all that simple, with all sorts of different rules
for base, size, access rights, type, flags and so on, dependent on which
segment it is, whether it's 16 or 32-bit, etc.

They aren't _too_ tricky, but after looking at it some, there are enough
rules there that I was actually more worried about introducing a
regression going down that route than in the patch I ended up with.

Cheers,
Stephen



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel