Mailing List Archive: [Patch] Fix for x86_64 boot failures due to bad segment setup for protected mode.

[Patch] Fix for x86_64 boot failures due to bad segment setup for protected mode.

Nov 8, 2006, 8:49 PM

Post #1 of 6 (92 views)

Hi,

We have been debugging a nasty VMX issue recently, where certain kernel
builds would fail to boot under VMX from syslinux, whereas others
booted fine and the same kernel booted OK under grub. In practice, this
means that we have only ever seen this on bootable ISO images.

The underlying problem turned out to be vmxassist mishandling the
setting up of segment registers on entry into VM86_PROTECTED mode. The
bug occurs when the guest is entering protected mode and reaches the far
jump to set up the 32-bit segment registers for the virtual VMENTER
we're about to perform. vmxassist, in protected_mode(), looks at the
segment registers, which might be 16-bit segments or might be 32-bit
segment selectors at this stage (depending on whether the OS has
reloaded them since entering protected mode); and it tries to load the
segments from the GDT. Unconditionally. Even if the segment register
still actually contains a real mode 16-bit segment. Whoops.

Now, enter the *second* bug, this time in the main Linux kernel itself.
On x86_64, the kernel boot sequence sets up a bogus GDT:
arch/x86_64/boot/setup.S has

gdt_48:
.word 0x8000 # gdt limit=2048,
# 256 GDT entries

.word 0, 0 # gdt base (filled in later)

which shows that we think we have a 2048-byte GDT, with 256 entries
(that all adds up fine); but 2048 is 0x800, not 0x8000. So when we do
an lgdt to set this gdt up, we are making it 16 times too big.

Unfortunately, when we enter this code from syslinux, SS has a 16-bit
value still, 0x3000. That should be way off the end of the GDT and
hence illegal as a descriptor even if we mistakenly tried to load it,
but because the GDT has been loaded with the wrong size, vmxassist
thinks that 0x3000 *IS* a valid segment descriptor, and loads it into
the VMCS for the guest's protected mode VMENTER.

And so, if, by chance, the 8 bytes at (GDT+0x3000) in the kernel image
pass the VMENTER instruction's simple sanity tests, we ignore the
problem and shortly afterwards the kernel will load up a valid SS; but
if we fail those sanity tests then the VMENTER fails with "bad guest
state". It's just luck whether a given vmlinuz works or not.

The reason that some kernels show this problem and others do not under
Xen is because of the 0x8000 GDT-size kernel bug. But the blame lies
squarely with Xen, because the kernel has never loaded any segments from
the undefined GDT area above 0x800, and yet vmxassist tried to set up a
VMCS segment from it anyway.

So, while we would still like to fix the kernel GDT, the *real* problem
here is in vmxassist's mishandling of segments during the move to
protected mode.

Now, vmxassist already has code to detect 16-bit segments that survived
unmodified from a transition into and out of protected mode, and to save
and restore those appropriately. It does this using "saved_rm_regs",
which get cleared on entry to protected mode, and then set to the old
segment value if we fail to set a given 32-bit segment correctly.

The fix is to save the 16-bit segments *always*, on entry to protected
mode when %CR0(PE) is first set; and to clear the saved 16-bit segment
and set the 32-bit variant in oldctx whenever a 32-bit segment
descriptor is set during the transition to 32-bit CS. Then, when we
finally do the VMENTER, we will set up the VMCS from only the 32-bit
segments, clearing the VMCS entries for segments that have not been
assigned valid 32-bit segments yet.

Tested on various RHEL-5 boot.isos, including ones which worked before
and ones which triggered the bug; all now boot correctly.

Signed-off-by: Stephen Tweedie <sct@redhat.com>