Mailing List Archive

[PATCH] x86/AMD: make HT range dynamic for Fam17 and up
At the time of d838ac2539cf ("x86: don't allow Dom0 access to the HT
address range") documentation correctly stated that the range was
completely fixed. For Fam17 and newer, it lives at the top of physical
address space, though.

To correctly determine the top of physical address space, we need to
account for their physical address reduction, hence the calculation of
paddr_bits also gets adjusted.

While for paddr_bits < 40 the HT range is completely hidden, there's no
need to suppress the range insertion in that case: It'll just have no
real meaning.

Reported-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>

--- a/xen/arch/x86/cpu/common.c
+++ b/xen/arch/x86/cpu/common.c
@@ -349,13 +349,17 @@ void __init early_cpu_init(void)

eax = cpuid_eax(0x80000000);
if ((eax >> 16) == 0x8000 && eax >= 0x80000008) {
+ ebx = eax >= 0x8000001f ? cpuid_ebx(0x8000001f) : 0;
eax = cpuid_eax(0x80000008);
- paddr_bits = eax & 0xff;
+
+ paddr_bits = (eax & 0xff) - ((ebx >> 6) & 0x3f);
if (paddr_bits > PADDR_BITS)
paddr_bits = PADDR_BITS;
+
vaddr_bits = (eax >> 8) & 0xff;
if (vaddr_bits > VADDR_BITS)
vaddr_bits = VADDR_BITS;
+
hap_paddr_bits = ((eax >> 16) & 0xff) ?: paddr_bits;
if (hap_paddr_bits > PADDR_BITS)
hap_paddr_bits = PADDR_BITS;
--- a/xen/arch/x86/dom0_build.c
+++ b/xen/arch/x86/dom0_build.c
@@ -524,8 +524,11 @@ int __init dom0_setup_permissions(struct
MSI_ADDR_DEST_ID_MASK));
/* HyperTransport range. */
if ( boot_cpu_data.x86_vendor & (X86_VENDOR_AMD | X86_VENDOR_HYGON) )
- rc |= iomem_deny_access(d, paddr_to_pfn(0xfdULL << 32),
- paddr_to_pfn((1ULL << 40) - 1));
+ {
+ mfn = paddr_to_pfn(1UL <<
+ (boot_cpu_data.x86 < 0x17 ? 40 : paddr_bits));
+ rc |= iomem_deny_access(d, mfn - paddr_to_pfn(3UL << 32), mfn - 1);
+ }

/* Remove access to E820_UNUSABLE I/O regions above 1MB. */
for ( i = 0; i < e820.nr_map; i++ )
Re: [PATCH] x86/AMD: make HT range dynamic for Fam17 and up [ In reply to ]
On 18/06/2021 17:00, Jan Beulich wrote:
> At the time of d838ac2539cf ("x86: don't allow Dom0 access to the HT
> address range") documentation correctly stated that the range was
> completely fixed. For Fam17 and newer, it lives at the top of physical
> address space, though.
>
> To correctly determine the top of physical address space, we need to
> account for their physical address reduction, hence the calculation of
> paddr_bits also gets adjusted.
>
> While for paddr_bits < 40 the HT range is completely hidden, there's no
> need to suppress the range insertion in that case: It'll just have no
> real meaning.
>
> Reported-by: Andrew Cooper <andrew.cooper3@citrix.com>

Really, this ought to be reported by Igor.  He did all the hard work.

Also, I'd like to get something more formal out of AMD if possible.

> Signed-off-by: Jan Beulich <jbeulich@suse.com>
>
> --- a/xen/arch/x86/cpu/common.c
> +++ b/xen/arch/x86/cpu/common.c
> @@ -349,13 +349,17 @@ void __init early_cpu_init(void)
>
> eax = cpuid_eax(0x80000000);
> if ((eax >> 16) == 0x8000 && eax >= 0x80000008) {
> + ebx = eax >= 0x8000001f ? cpuid_ebx(0x8000001f) : 0;
> eax = cpuid_eax(0x80000008);
> - paddr_bits = eax & 0xff;
> +
> + paddr_bits = (eax & 0xff) - ((ebx >> 6) & 0x3f);

While I can see the attraction of editing paddr_bits, I think it will
break the emulated pagewalk (at least).

With SME active, the address space reduction is only physical addresses
only, not guest physical.  An HVM guest still needs to see the original
paddr_bits, and the emulated pagewalk needs to use this for reserved bit
calculations.

i.e. under NPT, you can still have a working 2^48 guest physical address
space despite the host physical address space is limited to 2^43 by
encryption being active.

~Andrew
Re: [PATCH] x86/AMD: make HT range dynamic for Fam17 and up [ In reply to ]
On 18/06/2021 17:00, Jan Beulich wrote:
> At the time of d838ac2539cf ("x86: don't allow Dom0 access to the HT
> address range") documentation correctly stated that the range was
> completely fixed. For Fam17 and newer, it lives at the top of physical
> address space, though.

From "Open-Source Register Reference for AMD Family 17h Processors (PUB)":
https://developer.amd.com/wp-content/resources/56255_3_03.PDF

"The processor defines a reserved memory address region starting at
FFFD_0000_0000h and extending up to FFFF_FFFF_FFFFh."

It's still doesn't say that it's at the top of physical address space
although I understand that's how it's now implemented. The official
document doesn't confirm it will move along with physical address space
extension.

> To correctly determine the top of physical address space, we need to
> account for their physical address reduction, hence the calculation of
> paddr_bits also gets adjusted.
>
> While for paddr_bits < 40 the HT range is completely hidden, there's no
> need to suppress the range insertion in that case: It'll just have no
> real meaning.
>
> Reported-by: Andrew Cooper <andrew.cooper3@citrix.com>
> Signed-off-by: Jan Beulich <jbeulich@suse.com>
>
> --- a/xen/arch/x86/cpu/common.c
> +++ b/xen/arch/x86/cpu/common.c
> @@ -349,13 +349,17 @@ void __init early_cpu_init(void)
>
> eax = cpuid_eax(0x80000000);
> if ((eax >> 16) == 0x8000 && eax >= 0x80000008) {
> + ebx = eax >= 0x8000001f ? cpuid_ebx(0x8000001f) : 0;
> eax = cpuid_eax(0x80000008);
> - paddr_bits = eax & 0xff;
> +

I understand Andrew has some concerns regarding changing paddr_bits but
some comment explaining what's located at 0x8000001f:ebx[11:6] and why
we're doing this might be useful.

> + paddr_bits = (eax & 0xff) - ((ebx >> 6) & 0x3f);
> if (paddr_bits > PADDR_BITS)
> paddr_bits = PADDR_BITS;
> +
> vaddr_bits = (eax >> 8) & 0xff;
> if (vaddr_bits > VADDR_BITS)
> vaddr_bits = VADDR_BITS;
> +
> hap_paddr_bits = ((eax >> 16) & 0xff) ?: paddr_bits;
> if (hap_paddr_bits > PADDR_BITS)
> hap_paddr_bits = PADDR_BITS;
> --- a/xen/arch/x86/dom0_build.c
> +++ b/xen/arch/x86/dom0_build.c
> @@ -524,8 +524,11 @@ int __init dom0_setup_permissions(struct
> MSI_ADDR_DEST_ID_MASK));
> /* HyperTransport range. */
> if ( boot_cpu_data.x86_vendor & (X86_VENDOR_AMD | X86_VENDOR_HYGON) )
> - rc |= iomem_deny_access(d, paddr_to_pfn(0xfdULL << 32),
> - paddr_to_pfn((1ULL << 40) - 1));
> + {
> + mfn = paddr_to_pfn(1UL <<
> + (boot_cpu_data.x86 < 0x17 ? 40 : paddr_bits));

That doesn't really follow what Andrew gave us, namely:

1) On parts with <40 bits, its fully hidden from software
2) Before Fam17h, it was always 12G just below 1T, even if there was more RAM above this location
3) On Fam17h and later, it is variable based on SME, and is either just below 2^48 (no encryption) or 2^43 (encryption)

Do we need (1) to be coded here as well?

Igor
Re: [PATCH] x86/AMD: make HT range dynamic for Fam17 and up [ In reply to ]
On 18/06/2021 18:15, Igor Druzhinin wrote:
> On 18/06/2021 17:00, Jan Beulich wrote:
>> At the time of d838ac2539cf ("x86: don't allow Dom0 access to the HT
>> address range") documentation correctly stated that the range was
>> completely fixed. For Fam17 and newer, it lives at the top of physical
>> address space, though.
>
> From "Open-Source Register Reference for AMD Family 17h Processors (PUB)":
> https://developer.amd.com/wp-content/resources/56255_3_03.PDF
>
> "The processor defines a reserved memory address region starting at
> FFFD_0000_0000h and extending up to FFFF_FFFF_FFFFh."
>
> It's still doesn't say that it's at the top of physical address space
> although I understand that's how it's now implemented. The official
> document doesn't confirm it will move along with physical address space
> extension.
>
>> To correctly determine the top of physical address space, we need to
>> account for their physical address reduction, hence the calculation of
>> paddr_bits also gets adjusted.
>>
>> While for paddr_bits < 40 the HT range is completely hidden, there's no
>> need to suppress the range insertion in that case: It'll just have no
>> real meaning.
>>
>> Reported-by: Andrew Cooper <andrew.cooper3@citrix.com>
>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
>>
>> --- a/xen/arch/x86/cpu/common.c
>> +++ b/xen/arch/x86/cpu/common.c
>> @@ -349,13 +349,17 @@ void __init early_cpu_init(void)
>>       eax = cpuid_eax(0x80000000);
>>       if ((eax >> 16) == 0x8000 && eax >= 0x80000008) {
>> +        ebx = eax >= 0x8000001f ? cpuid_ebx(0x8000001f) : 0;
>>           eax = cpuid_eax(0x80000008);
>> -        paddr_bits = eax & 0xff;
>> +
>
> I understand Andrew has some concerns regarding changing paddr_bits but
> some comment explaining what's located at 0x8000001f:ebx[11:6] and why
> we're doing this might be useful.
>
>> +        paddr_bits = (eax & 0xff) - ((ebx >> 6) & 0x3f);
>>           if (paddr_bits > PADDR_BITS)
>>               paddr_bits = PADDR_BITS;
>> +
>>           vaddr_bits = (eax >> 8) & 0xff;
>>           if (vaddr_bits > VADDR_BITS)
>>               vaddr_bits = VADDR_BITS;
>> +
>>           hap_paddr_bits = ((eax >> 16) & 0xff) ?: paddr_bits;
>>           if (hap_paddr_bits > PADDR_BITS)
>>               hap_paddr_bits = PADDR_BITS;
>> --- a/xen/arch/x86/dom0_build.c
>> +++ b/xen/arch/x86/dom0_build.c
>> @@ -524,8 +524,11 @@ int __init dom0_setup_permissions(struct
>>                                            MSI_ADDR_DEST_ID_MASK));
>>       /* HyperTransport range. */
>>       if ( boot_cpu_data.x86_vendor & (X86_VENDOR_AMD | X86_VENDOR_HYGON) )
>> -        rc |= iomem_deny_access(d, paddr_to_pfn(0xfdULL << 32),
>> -                                paddr_to_pfn((1ULL << 40) - 1));
>> +    {
>> +        mfn = paddr_to_pfn(1UL <<
>> +                           (boot_cpu_data.x86 < 0x17 ? 40 : paddr_bits));
>
> That doesn't really follow what Andrew gave us, namely:
>
> 1) On parts with <40 bits, its fully hidden from software
> 2) Before Fam17h, it was always 12G just below 1T, even if there was more RAM above this location
> 3) On Fam17h and later, it is variable based on SME, and is either just below 2^48 (no encryption) or 2^43 (encryption)
>
> Do we need (1) to be coded here as well?

Ignore this last paragraph - I lost your statement in comment description.

Igor
Re: [PATCH] x86/AMD: make HT range dynamic for Fam17 and up [ In reply to ]
On 18.06.2021 18:32, Andrew Cooper wrote:
> On 18/06/2021 17:00, Jan Beulich wrote:
>> At the time of d838ac2539cf ("x86: don't allow Dom0 access to the HT
>> address range") documentation correctly stated that the range was
>> completely fixed. For Fam17 and newer, it lives at the top of physical
>> address space, though.
>>
>> To correctly determine the top of physical address space, we need to
>> account for their physical address reduction, hence the calculation of
>> paddr_bits also gets adjusted.
>>
>> While for paddr_bits < 40 the HT range is completely hidden, there's no
>> need to suppress the range insertion in that case: It'll just have no
>> real meaning.
>>
>> Reported-by: Andrew Cooper <andrew.cooper3@citrix.com>
>
> Really, this ought to be reported by Igor.  He did all the hard work.

Sure, changed.

>> --- a/xen/arch/x86/cpu/common.c
>> +++ b/xen/arch/x86/cpu/common.c
>> @@ -349,13 +349,17 @@ void __init early_cpu_init(void)
>>
>> eax = cpuid_eax(0x80000000);
>> if ((eax >> 16) == 0x8000 && eax >= 0x80000008) {
>> + ebx = eax >= 0x8000001f ? cpuid_ebx(0x8000001f) : 0;
>> eax = cpuid_eax(0x80000008);
>> - paddr_bits = eax & 0xff;
>> +
>> + paddr_bits = (eax & 0xff) - ((ebx >> 6) & 0x3f);
>
> While I can see the attraction of editing paddr_bits, I think it will
> break the emulated pagewalk (at least).
>
> With SME active, the address space reduction is only physical addresses
> only, not guest physical.  An HVM guest still needs to see the original
> paddr_bits, and the emulated pagewalk needs to use this for reserved bit
> calculations.
>
> i.e. under NPT, you can still have a working 2^48 guest physical address
> space despite the host physical address space is limited to 2^43 by
> encryption being active.

Which means we may need to split the variable into paddr_bits
(calculated as above) and gaddr_bits (left at what paddr_bits has
been so far). However, isn't that what hap_paddr_bits is already
for, while for shadow mode it still needs to be the "adjusted" way?
We'd then simply not fall back to the "adjusted" paddr_bits, but to
the original one.

Another aspect I was wondering about is whether

if (paddr_bits > PADDR_BITS)
paddr_bits = PADDR_BITS;

should apply before or after subtracting the value from
80000008.EBX. I was first inclined to make it the more relaxed
way (applying before reduction), but then thought I'd first leave
it as is now, on the basis that the PTE layout doesn't change, and
hence 52 remains the limit for the full address.

Jan
Re: [PATCH] x86/AMD: make HT range dynamic for Fam17 and up [ In reply to ]
On 21.06.2021 08:18, Jan Beulich wrote:
> Another aspect I was wondering about is whether
>
> if (paddr_bits > PADDR_BITS)
> paddr_bits = PADDR_BITS;
>
> should apply before or after subtracting the value from
> 80000008.EBX. I was first inclined to make it the more relaxed
> way (applying before reduction), but then thought I'd first leave
> it as is now, on the basis that the PTE layout doesn't change, and
> hence 52 remains the limit for the full address.

Actually that was the wrong way round - the PTE layout argument
suggests I should cap first, then subtract. Which will also simplify
what needs doing for hap_paddr_bits.

Jan
Re: [PATCH] x86/AMD: make HT range dynamic for Fam17 and up [ In reply to ]
On 21.06.2021 08:18, Jan Beulich wrote:
> On 18.06.2021 18:32, Andrew Cooper wrote:
>> On 18/06/2021 17:00, Jan Beulich wrote:
>>> At the time of d838ac2539cf ("x86: don't allow Dom0 access to the HT
>>> address range") documentation correctly stated that the range was
>>> completely fixed. For Fam17 and newer, it lives at the top of physical
>>> address space, though.
>>>
>>> To correctly determine the top of physical address space, we need to
>>> account for their physical address reduction, hence the calculation of
>>> paddr_bits also gets adjusted.
>>>
>>> While for paddr_bits < 40 the HT range is completely hidden, there's no
>>> need to suppress the range insertion in that case: It'll just have no
>>> real meaning.
>>>
>>> Reported-by: Andrew Cooper <andrew.cooper3@citrix.com>
>>
>> Really, this ought to be reported by Igor.  He did all the hard work.
>
> Sure, changed.
>
>>> --- a/xen/arch/x86/cpu/common.c
>>> +++ b/xen/arch/x86/cpu/common.c
>>> @@ -349,13 +349,17 @@ void __init early_cpu_init(void)
>>>
>>> eax = cpuid_eax(0x80000000);
>>> if ((eax >> 16) == 0x8000 && eax >= 0x80000008) {
>>> + ebx = eax >= 0x8000001f ? cpuid_ebx(0x8000001f) : 0;
>>> eax = cpuid_eax(0x80000008);
>>> - paddr_bits = eax & 0xff;
>>> +
>>> + paddr_bits = (eax & 0xff) - ((ebx >> 6) & 0x3f);
>>
>> While I can see the attraction of editing paddr_bits, I think it will
>> break the emulated pagewalk (at least).
>>
>> With SME active, the address space reduction is only physical addresses
>> only, not guest physical.  An HVM guest still needs to see the original
>> paddr_bits, and the emulated pagewalk needs to use this for reserved bit
>> calculations.
>>
>> i.e. under NPT, you can still have a working 2^48 guest physical address
>> space despite the host physical address space is limited to 2^43 by
>> encryption being active.
>
> Which means we may need to split the variable into paddr_bits
> (calculated as above) and gaddr_bits (left at what paddr_bits has
> been so far). However, isn't that what hap_paddr_bits is already
> for, while for shadow mode it still needs to be the "adjusted" way?
> We'd then simply not fall back to the "adjusted" paddr_bits, but to
> the original one.
>
> Another aspect I was wondering about is whether
>
> if (paddr_bits > PADDR_BITS)
> paddr_bits = PADDR_BITS;
>
> should apply before or after subtracting the value from
> 80000008.EBX.

And of course this was meant to be 8000001F.EBX.

Jan

> I was first inclined to make it the more relaxed
> way (applying before reduction), but then thought I'd first leave
> it as is now, on the basis that the PTE layout doesn't change, and
> hence 52 remains the limit for the full address.
>
> Jan
>
>