Mailing List Archive

[PATCH] KVM: x86/mmu: Add "never" option to allow sticky disabling of nx_huge_pages
Add a "never" option to the nx_huge_pages module param to allow userspace
to do a one-way hard disabling of the mitigation, and don't create the
per-VM recovery threads when the mitigation is hard disabled. Letting
userspace pinky swear that userspace doesn't want to enable NX mitigation
(without reloading KVM) allows certain use cases to avoid the latency
problems associated with spawning a kthread for each VM.

E.g. in FaaS use cases, the guest kernel is trusted and the host may
create 100+ VMs per logical CPU, which can result in 100ms+ latencies when
a burst of VMs is created.

Reported-by: Li RongQing <lirongqing@baidu.com>
Closes: https://lore.kernel.org/all/1679555884-32544-1-git-send-email-lirongqing@baidu.com
Cc: Yong He <zhuangel570@gmail.com>
Cc: Robert Hoo <robert.hoo.linux@gmail.com>
Cc: Kai Huang <kai.huang@intel.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
arch/x86/kvm/mmu/mmu.c | 41 ++++++++++++++++++++++++++++++++++++-----
1 file changed, 36 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index c8961f45e3b1..2ed38916b904 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -58,6 +58,8 @@

extern bool itlb_multihit_kvm_mitigation;

+static bool nx_hugepage_mitigation_hard_disabled;
+
int __read_mostly nx_huge_pages = -1;
static uint __read_mostly nx_huge_pages_recovery_period_ms;
#ifdef CONFIG_PREEMPT_RT
@@ -67,12 +69,13 @@ static uint __read_mostly nx_huge_pages_recovery_ratio = 0;
static uint __read_mostly nx_huge_pages_recovery_ratio = 60;
#endif

+static int get_nx_huge_pages(char *buffer, const struct kernel_param *kp);
static int set_nx_huge_pages(const char *val, const struct kernel_param *kp);
static int set_nx_huge_pages_recovery_param(const char *val, const struct kernel_param *kp);

static const struct kernel_param_ops nx_huge_pages_ops = {
.set = set_nx_huge_pages,
- .get = param_get_bool,
+ .get = get_nx_huge_pages,
};

static const struct kernel_param_ops nx_huge_pages_recovery_param_ops = {
@@ -6844,6 +6847,14 @@ static void mmu_destroy_caches(void)
kmem_cache_destroy(mmu_page_header_cache);
}

+static int get_nx_huge_pages(char *buffer, const struct kernel_param *kp)
+{
+ if (nx_hugepage_mitigation_hard_disabled)
+ return sprintf(buffer, "never\n");
+
+ return param_get_bool(buffer, kp);
+}
+
static bool get_nx_auto_mode(void)
{
/* Return true when CPU has the bug, and mitigations are ON */
@@ -6860,15 +6871,29 @@ static int set_nx_huge_pages(const char *val, const struct kernel_param *kp)
bool old_val = nx_huge_pages;
bool new_val;

+ if (nx_hugepage_mitigation_hard_disabled)
+ return -EPERM;
+
/* In "auto" mode deploy workaround only if CPU has the bug. */
- if (sysfs_streq(val, "off"))
+ if (sysfs_streq(val, "off")) {
new_val = 0;
- else if (sysfs_streq(val, "force"))
+ } else if (sysfs_streq(val, "force")) {
new_val = 1;
- else if (sysfs_streq(val, "auto"))
+ } else if (sysfs_streq(val, "auto")) {
new_val = get_nx_auto_mode();
- else if (kstrtobool(val, &new_val) < 0)
+ } else if (sysfs_streq(val, "never")) {
+ new_val = 0;
+
+ mutex_lock(&kvm_lock);
+ if (!list_empty(&vm_list)) {
+ mutex_unlock(&kvm_lock);
+ return -EBUSY;
+ }
+ nx_hugepage_mitigation_hard_disabled = true;
+ mutex_unlock(&kvm_lock);
+ } else if (kstrtobool(val, &new_val) < 0) {
return -EINVAL;
+ }

__set_nx_huge_pages(new_val);

@@ -7006,6 +7031,9 @@ static int set_nx_huge_pages_recovery_param(const char *val, const struct kernel
uint old_period, new_period;
int err;

+ if (nx_hugepage_mitigation_hard_disabled)
+ return -EPERM;
+
was_recovery_enabled = calc_nx_huge_pages_recovery_period(&old_period);

err = param_set_uint(val, kp);
@@ -7161,6 +7189,9 @@ int kvm_mmu_post_init_vm(struct kvm *kvm)
{
int err;

+ if (nx_hugepage_mitigation_hard_disabled)
+ return 0;
+
err = kvm_vm_create_worker_thread(kvm, kvm_nx_huge_page_recovery_worker, 0,
"kvm-nx-lpage-recovery",
&kvm->arch.nx_huge_page_recovery_thread);

base-commit: 39428f6ea9eace95011681628717062ff7f5eb5f
--
2.41.0.rc2.161.g9c6817b8e7-goog
Re: [PATCH] KVM: x86/mmu: Add "never" option to allow sticky disabling of nx_huge_pages [ In reply to ]
On 6/2/2023 8:58 AM, Sean Christopherson wrote:
> Add a "never" option to the nx_huge_pages module param to allow userspace
> to do a one-way hard disabling of the mitigation, and don't create the
> per-VM recovery threads when the mitigation is hard disabled. Letting
> userspace pinky swear that userspace doesn't want to enable NX mitigation
> (without reloading KVM) allows certain use cases to avoid the latency
> problems associated with spawning a kthread for each VM.
>
> E.g. in FaaS use cases, the guest kernel is trusted and the host may
> create 100+ VMs per logical CPU, which can result in 100ms+ latencies when
> a burst of VMs is created.
>
> Reported-by: Li RongQing <lirongqing@baidu.com>
> Closes: https://lore.kernel.org/all/1679555884-32544-1-git-send-email-lirongqing@baidu.com
> Cc: Yong He <zhuangel570@gmail.com>
> Cc: Robert Hoo <robert.hoo.linux@gmail.com>
> Cc: Kai Huang <kai.huang@intel.com>
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
> arch/x86/kvm/mmu/mmu.c | 41 ++++++++++++++++++++++++++++++++++++-----
> 1 file changed, 36 insertions(+), 5 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index c8961f45e3b1..2ed38916b904 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -58,6 +58,8 @@
>
> extern bool itlb_multihit_kvm_mitigation;
>
> +static bool nx_hugepage_mitigation_hard_disabled;
> +
> int __read_mostly nx_huge_pages = -1;
> static uint __read_mostly nx_huge_pages_recovery_period_ms;
> #ifdef CONFIG_PREEMPT_RT
> @@ -67,12 +69,13 @@ static uint __read_mostly nx_huge_pages_recovery_ratio = 0;
> static uint __read_mostly nx_huge_pages_recovery_ratio = 60;
> #endif
>
> +static int get_nx_huge_pages(char *buffer, const struct kernel_param *kp);
> static int set_nx_huge_pages(const char *val, const struct kernel_param *kp);
> static int set_nx_huge_pages_recovery_param(const char *val, const struct kernel_param *kp);
>
> static const struct kernel_param_ops nx_huge_pages_ops = {
> .set = set_nx_huge_pages,
> - .get = param_get_bool,
> + .get = get_nx_huge_pages,
> };
>
> static const struct kernel_param_ops nx_huge_pages_recovery_param_ops = {
> @@ -6844,6 +6847,14 @@ static void mmu_destroy_caches(void)
> kmem_cache_destroy(mmu_page_header_cache);
> }
>
> +static int get_nx_huge_pages(char *buffer, const struct kernel_param *kp)
> +{
> + if (nx_hugepage_mitigation_hard_disabled)
> + return sprintf(buffer, "never\n");
> +
> + return param_get_bool(buffer, kp);
> +}
> +
> static bool get_nx_auto_mode(void)
> {
> /* Return true when CPU has the bug, and mitigations are ON */
> @@ -6860,15 +6871,29 @@ static int set_nx_huge_pages(const char *val, const struct kernel_param *kp)
> bool old_val = nx_huge_pages;
> bool new_val;
>
> + if (nx_hugepage_mitigation_hard_disabled)
> + return -EPERM;
> +
> /* In "auto" mode deploy workaround only if CPU has the bug. */
> - if (sysfs_streq(val, "off"))
> + if (sysfs_streq(val, "off")) {
> new_val = 0;
> - else if (sysfs_streq(val, "force"))
> + } else if (sysfs_streq(val, "force")) {
> new_val = 1;
> - else if (sysfs_streq(val, "auto"))
> + } else if (sysfs_streq(val, "auto")) {
> new_val = get_nx_auto_mode();
> - else if (kstrtobool(val, &new_val) < 0)
> + } else if (sysfs_streq(val, "never")) {
> + new_val = 0;
> +
> + mutex_lock(&kvm_lock);
> + if (!list_empty(&vm_list)) {
> + mutex_unlock(&kvm_lock);
> + return -EBUSY;
> + }
> + nx_hugepage_mitigation_hard_disabled = true;
> + mutex_unlock(&kvm_lock);
> + } else if (kstrtobool(val, &new_val) < 0) {
> return -EINVAL;
> + }
>

IIUC, (Initially) "auto_off"/"off" --> create some VM --> turn to "never",
the created VMs still have those kthreads, but can never be used, until
destroyed with VM.

Is this designed/accepted pattern?

> __set_nx_huge_pages(new_val);
>
> @@ -7006,6 +7031,9 @@ static int set_nx_huge_pages_recovery_param(const char *val, const struct kernel
> uint old_period, new_period;
> int err;
>
> + if (nx_hugepage_mitigation_hard_disabled)
> + return -EPERM;
> +
> was_recovery_enabled = calc_nx_huge_pages_recovery_period(&old_period);
>
> err = param_set_uint(val, kp);
> @@ -7161,6 +7189,9 @@ int kvm_mmu_post_init_vm(struct kvm *kvm)
> {
> int err;
>
> + if (nx_hugepage_mitigation_hard_disabled)
> + return 0;
> +
> err = kvm_vm_create_worker_thread(kvm, kvm_nx_huge_page_recovery_worker, 0,
> "kvm-nx-lpage-recovery",
> &kvm->arch.nx_huge_page_recovery_thread);
>
> base-commit: 39428f6ea9eace95011681628717062ff7f5eb5f
Re: [PATCH] KVM: x86/mmu: Add "never" option to allow sticky disabling of nx_huge_pages [ In reply to ]
On Fri, Jun 02, 2023, Robert Hoo wrote:
> On 6/2/2023 8:58 AM, Sean Christopherson wrote:
> > @@ -6860,15 +6871,29 @@ static int set_nx_huge_pages(const char *val, const struct kernel_param *kp)
> > bool old_val = nx_huge_pages;
> > bool new_val;
> > + if (nx_hugepage_mitigation_hard_disabled)
> > + return -EPERM;
> > +
> > /* In "auto" mode deploy workaround only if CPU has the bug. */
> > - if (sysfs_streq(val, "off"))
> > + if (sysfs_streq(val, "off")) {
> > new_val = 0;
> > - else if (sysfs_streq(val, "force"))
> > + } else if (sysfs_streq(val, "force")) {
> > new_val = 1;
> > - else if (sysfs_streq(val, "auto"))
> > + } else if (sysfs_streq(val, "auto")) {
> > new_val = get_nx_auto_mode();
> > - else if (kstrtobool(val, &new_val) < 0)
> > + } else if (sysfs_streq(val, "never")) {
> > + new_val = 0;
> > +
> > + mutex_lock(&kvm_lock);
> > + if (!list_empty(&vm_list)) {
> > + mutex_unlock(&kvm_lock);
> > + return -EBUSY;
> > + }
> > + nx_hugepage_mitigation_hard_disabled = true;
> > + mutex_unlock(&kvm_lock);
> > + } else if (kstrtobool(val, &new_val) < 0) {
> > return -EINVAL;
> > + }
> >
>
> IIUC, (Initially) "auto_off"/"off" --> create some VM --> turn to "never",
> the created VMs still have those kthreads, but can never be used, until
> destroyed with VM.

Shouldn't be able to happen. The above rejects "never" if vm_list isn't empty,
i.e. if there are any VMs, and sets nx_hugepage_mitigation_hard_disabled under
kvm_lock to ensure it can't race with KVM_CREATE_VM. I forgot to call this out
in the changelog though.
Re: [PATCH] KVM: x86/mmu: Add "never" option to allow sticky disabling of nx_huge_pages [ In reply to ]
On 6/2/2023 11:09 PM, Sean Christopherson wrote:
> On Fri, Jun 02, 2023, Robert Hoo wrote:
>> On 6/2/2023 8:58 AM, Sean Christopherson wrote:
>>> @@ -6860,15 +6871,29 @@ static int set_nx_huge_pages(const char *val, const struct kernel_param *kp)
>>> bool old_val = nx_huge_pages;
>>> bool new_val;
>>> + if (nx_hugepage_mitigation_hard_disabled)
>>> + return -EPERM;
>>> +
>>> /* In "auto" mode deploy workaround only if CPU has the bug. */
>>> - if (sysfs_streq(val, "off"))
>>> + if (sysfs_streq(val, "off")) {
>>> new_val = 0;
>>> - else if (sysfs_streq(val, "force"))
>>> + } else if (sysfs_streq(val, "force")) {
>>> new_val = 1;
>>> - else if (sysfs_streq(val, "auto"))
>>> + } else if (sysfs_streq(val, "auto")) {
>>> new_val = get_nx_auto_mode();
>>> - else if (kstrtobool(val, &new_val) < 0)
>>> + } else if (sysfs_streq(val, "never")) {
>>> + new_val = 0;
>>> +
>>> + mutex_lock(&kvm_lock);
>>> + if (!list_empty(&vm_list)) {
>>> + mutex_unlock(&kvm_lock);
>>> + return -EBUSY;

Ah, right, thanks for explanation.

Reviewed-by: Robert Hoo <robert.hoo.linux@gmail.com>

>>> + }
>>> + nx_hugepage_mitigation_hard_disabled = true;
>>> + mutex_unlock(&kvm_lock);
>>> + } else if (kstrtobool(val, &new_val) < 0) {
>>> return -EINVAL;
>>> + }
>>>
>>
>> IIUC, (Initially) "auto_off"/"off" --> create some VM --> turn to "never",
>> the created VMs still have those kthreads, but can never be used, until
>> destroyed with VM.
>
> Shouldn't be able to happen. The above rejects "never" if vm_list isn't empty,
> i.e. if there are any VMs, and sets nx_hugepage_mitigation_hard_disabled under
> kvm_lock to ensure it can't race with KVM_CREATE_VM. I forgot to call this out
> in the changelog though
RE: [PATCH] KVM: x86/mmu: Add "never" option to allow sticky disabling of nx_huge_pages [ In reply to ]
> -----Original Message-----
> From: Sean Christopherson <seanjc@google.com>
> Sent: Friday, June 2, 2023 8:59 AM
> To: Sean Christopherson <seanjc@google.com>; Paolo Bonzini
> <pbonzini@redhat.com>
> Cc: kvm@vger.kernel.org; linux-kernel@vger.kernel.org; Li,Rongqing
> <lirongqing@baidu.com>; Yong He <zhuangel570@gmail.com>; Robert Hoo
> <robert.hoo.linux@gmail.com>; Kai Huang <kai.huang@intel.com>
> Subject: [PATCH] KVM: x86/mmu: Add "never" option to allow sticky disabling of
> nx_huge_pages
>
> Add a "never" option to the nx_huge_pages module param to allow userspace
> to do a one-way hard disabling of the mitigation, and don't create the per-VM
> recovery threads when the mitigation is hard disabled. Letting userspace pinky
> swear that userspace doesn't want to enable NX mitigation (without reloading
> KVM) allows certain use cases to avoid the latency problems associated with
> spawning a kthread for each VM.
>
> E.g. in FaaS use cases, the guest kernel is trusted and the host may create 100+
> VMs per logical CPU, which can result in 100ms+ latencies when a burst of VMs
> is created.
>


Reviewed-by: Li RongQing <lirongqing@baidu.com>

And I hope nx_huge_pages is never by default if CPU reports that it doesn't have such bug

Thanks

-Li RongQing
Re: [PATCH] KVM: x86/mmu: Add "never" option to allow sticky disabling of nx_huge_pages [ In reply to ]
On Fri, 2023-06-02 at 08:09 -0700, Sean Christopherson wrote:
> On Fri, Jun 02, 2023, Robert Hoo wrote:
> > On 6/2/2023 8:58 AM, Sean Christopherson wrote:
> > > @@ -6860,15 +6871,29 @@ static int set_nx_huge_pages(const char *val, const struct kernel_param *kp)
> > > bool old_val = nx_huge_pages;
> > > bool new_val;
> > > + if (nx_hugepage_mitigation_hard_disabled)
> > > + return -EPERM;
> > > +
> > > /* In "auto" mode deploy workaround only if CPU has the bug. */
> > > - if (sysfs_streq(val, "off"))
> > > + if (sysfs_streq(val, "off")) {
> > > new_val = 0;
> > > - else if (sysfs_streq(val, "force"))
> > > + } else if (sysfs_streq(val, "force")) {
> > > new_val = 1;
> > > - else if (sysfs_streq(val, "auto"))
> > > + } else if (sysfs_streq(val, "auto")) {
> > > new_val = get_nx_auto_mode();
> > > - else if (kstrtobool(val, &new_val) < 0)
> > > + } else if (sysfs_streq(val, "never")) {
> > > + new_val = 0;
> > > +
> > > + mutex_lock(&kvm_lock);
> > > + if (!list_empty(&vm_list)) {
> > > + mutex_unlock(&kvm_lock);
> > > + return -EBUSY;
> > > + }
> > > + nx_hugepage_mitigation_hard_disabled = true;
> > > + mutex_unlock(&kvm_lock);
> > > + } else if (kstrtobool(val, &new_val) < 0) {
> > > return -EINVAL;
> > > + }
> > >
> >
> > IIUC, (Initially) "auto_off"/"off" --> create some VM --> turn to "never",
> > the created VMs still have those kthreads, but can never be used, until
> > destroyed with VM.
>
> Shouldn't be able to happen. The above rejects "never" if vm_list isn't empty,
> i.e. if there are any VMs, and sets nx_hugepage_mitigation_hard_disabled under
> kvm_lock to ensure it can't race with KVM_CREATE_VM. I forgot to call this out
> in the changelog though.

Also IIUC once set to "never" userspace will not be able to switch to other
modes, unless userspace reloads KVM. Worth to call out this too?

Acked-by: Kai Huang <kai.huang@intel.com>
Re: [PATCH] KVM: x86/mmu: Add "never" option to allow sticky disabling of nx_huge_pages [ In reply to ]
On Thu, Jun 01, 2023 at 05:58:59PM -0700, Sean Christopherson wrote:
> Add a "never" option to the nx_huge_pages module param to allow userspace
> to do a one-way hard disabling of the mitigation, and don't create the
> per-VM recovery threads when the mitigation is hard disabled. Letting
> userspace pinky swear that userspace doesn't want to enable NX mitigation
> (without reloading KVM) allows certain use cases to avoid the latency
> problems associated with spawning a kthread for each VM.
>
> E.g. in FaaS use cases, the guest kernel is trusted and the host may
> create 100+ VMs per logical CPU, which can result in 100ms+ latencies when
> a burst of VMs is created.

Tested-by: Luiz Capitulino <luizcap@amazon.com>

Without this patch I can see the 100ms+ latencies on KVM_CREATE_VM even
with a single VM.

Just run a VM with with strace -T and grep for KVM_CREATE_VM. When using
kvmtool I get (latency in seconds - kernel HEAD is a4d7d70112):

ioctl(3, KVM_CREATE_VM, 0) = 4 <0.023567>

ioctl(3, KVM_CREATE_VM, 0) = 4 <0.076709>

ioctl(3, KVM_CREATE_VM, 0) = 4 <0.109109>

With this patch and nx_huge_page=never:

ioctl(3, KVM_CREATE_VM, 0) = 4 <0.000518>

ioctl(3, KVM_CREATE_VM, 0) = 4 <0.000495>

ioctl(3, KVM_CREATE_VM, 0) = 4 <0.000513>

Now, I debugged down the single VM case before seeing this patch and it
can be avoided by building the kernel with CONFIG_CGROUP_FAVOR_DYNMODS=y
or mounting the cgroup v2 mount point with the favordynmods mount option.

This is because the high latency is coming from a call to
cgroup_attach_task_all() in:

kvm_vm_worker_thread()
cgroup_attach_task_all()
percpu_down_write(&cgroup_threadgroup_rwsem)
/* calls synchronize_rcu() */

This happens while kvm_vm_create_worker_thread() is waiting on a completion.
See commit 6a010a49b63a for more information.

This patch is preferable because the favordynmods solution has a trade-off.
However, why don't we make nx_huge_pages=never the default behavior if the
CPU is not vulnerable? If there are concerns about not being able to restart
the worker thread, then maybe we could make this a .config option?

- Luiz


>
> Reported-by: Li RongQing <lirongqing@baidu.com>
> Closes: https://lore.kernel.org/all/1679555884-32544-1-git-send-email-lirongqing@baidu.com
> Cc: Yong He <zhuangel570@gmail.com>
> Cc: Robert Hoo <robert.hoo.linux@gmail.com>
> Cc: Kai Huang <kai.huang@intel.com>
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
> arch/x86/kvm/mmu/mmu.c | 41 ++++++++++++++++++++++++++++++++++++-----
> 1 file changed, 36 insertions(+), 5 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index c8961f45e3b1..2ed38916b904 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -58,6 +58,8 @@
>
> extern bool itlb_multihit_kvm_mitigation;
>
> +static bool nx_hugepage_mitigation_hard_disabled;
> +
> int __read_mostly nx_huge_pages = -1;
> static uint __read_mostly nx_huge_pages_recovery_period_ms;
> #ifdef CONFIG_PREEMPT_RT
> @@ -67,12 +69,13 @@ static uint __read_mostly nx_huge_pages_recovery_ratio = 0;
> static uint __read_mostly nx_huge_pages_recovery_ratio = 60;
> #endif
>
> +static int get_nx_huge_pages(char *buffer, const struct kernel_param *kp);
> static int set_nx_huge_pages(const char *val, const struct kernel_param *kp);
> static int set_nx_huge_pages_recovery_param(const char *val, const struct kernel_param *kp);
>
> static const struct kernel_param_ops nx_huge_pages_ops = {
> .set = set_nx_huge_pages,
> - .get = param_get_bool,
> + .get = get_nx_huge_pages,
> };
>
> static const struct kernel_param_ops nx_huge_pages_recovery_param_ops = {
> @@ -6844,6 +6847,14 @@ static void mmu_destroy_caches(void)
> kmem_cache_destroy(mmu_page_header_cache);
> }
>
> +static int get_nx_huge_pages(char *buffer, const struct kernel_param *kp)
> +{
> + if (nx_hugepage_mitigation_hard_disabled)
> + return sprintf(buffer, "never\n");
> +
> + return param_get_bool(buffer, kp);
> +}
> +
> static bool get_nx_auto_mode(void)
> {
> /* Return true when CPU has the bug, and mitigations are ON */
> @@ -6860,15 +6871,29 @@ static int set_nx_huge_pages(const char *val, const struct kernel_param *kp)
> bool old_val = nx_huge_pages;
> bool new_val;
>
> + if (nx_hugepage_mitigation_hard_disabled)
> + return -EPERM;
> +
> /* In "auto" mode deploy workaround only if CPU has the bug. */
> - if (sysfs_streq(val, "off"))
> + if (sysfs_streq(val, "off")) {
> new_val = 0;
> - else if (sysfs_streq(val, "force"))
> + } else if (sysfs_streq(val, "force")) {
> new_val = 1;
> - else if (sysfs_streq(val, "auto"))
> + } else if (sysfs_streq(val, "auto")) {
> new_val = get_nx_auto_mode();
> - else if (kstrtobool(val, &new_val) < 0)
> + } else if (sysfs_streq(val, "never")) {
> + new_val = 0;
> +
> + mutex_lock(&kvm_lock);
> + if (!list_empty(&vm_list)) {
> + mutex_unlock(&kvm_lock);
> + return -EBUSY;
> + }
> + nx_hugepage_mitigation_hard_disabled = true;
> + mutex_unlock(&kvm_lock);
> + } else if (kstrtobool(val, &new_val) < 0) {
> return -EINVAL;
> + }
>
> __set_nx_huge_pages(new_val);
>
> @@ -7006,6 +7031,9 @@ static int set_nx_huge_pages_recovery_param(const char *val, const struct kernel
> uint old_period, new_period;
> int err;
>
> + if (nx_hugepage_mitigation_hard_disabled)
> + return -EPERM;
> +
> was_recovery_enabled = calc_nx_huge_pages_recovery_period(&old_period);
>
> err = param_set_uint(val, kp);
> @@ -7161,6 +7189,9 @@ int kvm_mmu_post_init_vm(struct kvm *kvm)
> {
> int err;
>
> + if (nx_hugepage_mitigation_hard_disabled)
> + return 0;
> +
> err = kvm_vm_create_worker_thread(kvm, kvm_nx_huge_page_recovery_worker, 0,
> "kvm-nx-lpage-recovery",
> &kvm->arch.nx_huge_page_recovery_thread);
>
> base-commit: 39428f6ea9eace95011681628717062ff7f5eb5f
> --
> 2.41.0.rc2.161.g9c6817b8e7-goog
>
Re: [PATCH] KVM: x86/mmu: Add "never" option to allow sticky disabling of nx_huge_pages [ In reply to ]
On Tue, Jun 06, 2023, Luiz Capitulino wrote:
> On Thu, Jun 01, 2023 at 05:58:59PM -0700, Sean Christopherson wrote:
> However, why don't we make nx_huge_pages=never the default behavior if the
> CPU is not vulnerable?

Mainly because the mitigation has been around for 3.5 years, and there's a non-zero
chance that making "never" the default could cause hiccups for unsuspecting users.
If this were brand new code, I would definitely opt for "never" as the default.

> If there are concerns about not being able to restart the worker thread, then
> maybe we could make this a .config option?

Eh, a Kconfig is unnecessarily complex, and wouldn't really change anything, e.g.
for users in the know, it's just as easy to force a module param as it is to force
a Kconfig, and to gain any benefit from the param being !never by default, the
Kconfig would also have to be off by default.

If "everyone" wants never to be the default, and Paolo doesn't object, I'd rather
just tack on a patch to make that happen, and cross my fingers there's no fallout :-)
Re: [PATCH] KVM: x86/mmu: Add "never" option to allow sticky disabling of nx_huge_pages [ In reply to ]
On Tue, Jun 06, 2023 at 03:03:38PM -0700, Sean Christopherson wrote:
>
>
>
> On Tue, Jun 06, 2023, Luiz Capitulino wrote:
> > On Thu, Jun 01, 2023 at 05:58:59PM -0700, Sean Christopherson wrote:
> > However, why don't we make nx_huge_pages=never the default behavior if the
> > CPU is not vulnerable?
>
> Mainly because the mitigation has been around for 3.5 years, and there's a non-zero
> chance that making "never" the default could cause hiccups for unsuspecting users.
> If this were brand new code, I would definitely opt for "never" as the default.

OK.

> > If there are concerns about not being able to restart the worker thread, then
> > maybe we could make this a .config option?
>
> Eh, a Kconfig is unnecessarily complex, and wouldn't really change anything, e.g.
> for users in the know, it's just as easy to force a module param as it is to force
> a Kconfig, and to gain any benefit from the param being !never by default, the
> Kconfig would also have to be off by default.

I agree it adds some complexity. The benefit is to allow shipping a kernel with
a good default where KVM users with non-vulnerable CPUs get low latency out
of the box (vs. having to figure out if they are vulnerable or not and changing
a run-time configuration).

But the idea would be to set "never" by default as long as the CPU is not vulnerable.

> If "everyone" wants never to be the default, and Paolo doesn't object, I'd rather
> just tack on a patch to make that happen, and cross my fingers there's no fallout :-)

This would work too :)
Re: [PATCH] KVM: x86/mmu: Add "never" option to allow sticky disabling of nx_huge_pages [ In reply to ]
On Thu, 01 Jun 2023 17:58:59 -0700, Sean Christopherson wrote:
> Add a "never" option to the nx_huge_pages module param to allow userspace
> to do a one-way hard disabling of the mitigation, and don't create the
> per-VM recovery threads when the mitigation is hard disabled. Letting
> userspace pinky swear that userspace doesn't want to enable NX mitigation
> (without reloading KVM) allows certain use cases to avoid the latency
> problems associated with spawning a kthread for each VM.
>
> [...]

Applied to kvm-x86 mmu. I kept the default as "auto" for now, as that can go on
top and I don't want to introduce that change this late in the cycle. If no one
beats me to the punch (hint, hint ;-) ), I'll post a patch to make "never" the
default for unaffected hosts so that we can discuss/consider that change for 6.6.

Thanks for the reviews!

[1/1] KVM: x86/mmu: Add "never" option to allow sticky disabling of nx_huge_pages
https://github.com/kvm-x86/linux/commit/0b210faf3373

--
https://github.com/kvm-x86/linux/tree/next
https://github.com/kvm-x86/linux/tree/fixes
Re: [PATCH] KVM: x86/mmu: Add "never" option to allow sticky disabling of nx_huge_pages [ In reply to ]
On 2023-06-13 19:21, Sean Christopherson wrote:

>
>
>
> On Thu, 01 Jun 2023 17:58:59 -0700, Sean Christopherson wrote:
>> Add a "never" option to the nx_huge_pages module param to allow userspace
>> to do a one-way hard disabling of the mitigation, and don't create the
>> per-VM recovery threads when the mitigation is hard disabled. Letting
>> userspace pinky swear that userspace doesn't want to enable NX mitigation
>> (without reloading KVM) allows certain use cases to avoid the latency
>> problems associated with spawning a kthread for each VM.
>>
>> [...]
>
> Applied to kvm-x86 mmu. I kept the default as "auto" for now, as that can go on
> top and I don't want to introduce that change this late in the cycle. If no one
> beats me to the punch (hint, hint ;-) ), I'll post a patch to make "never" the
> default for unaffected hosts so that we can discuss/consider that change for 6.6.

Thanks Sean, I agree with the plan. I could give a try on the patch if you'd like.

- Luiz

>
> Thanks for the reviews!
>
> [1/1] KVM: x86/mmu: Add "never" option to allow sticky disabling of nx_huge_pages
> https://github.com/kvm-x86/linux/commit/0b210faf3373
>
> --
> https://github.com/kvm-x86/linux/tree/next
> https://github.com/kvm-x86/linux/tree/fixes
Re: [PATCH] KVM: x86/mmu: Add "never" option to allow sticky disabling of nx_huge_pages [ In reply to ]
On Wed, Jun 14, 2023, Luiz Capitulino wrote:
>
>
> On 2023-06-13 19:21, Sean Christopherson wrote:
>
> >
> >
> >
> > On Thu, 01 Jun 2023 17:58:59 -0700, Sean Christopherson wrote:
> > > Add a "never" option to the nx_huge_pages module param to allow userspace
> > > to do a one-way hard disabling of the mitigation, and don't create the
> > > per-VM recovery threads when the mitigation is hard disabled. Letting
> > > userspace pinky swear that userspace doesn't want to enable NX mitigation
> > > (without reloading KVM) allows certain use cases to avoid the latency
> > > problems associated with spawning a kthread for each VM.
> > >
> > > [...]
> >
> > Applied to kvm-x86 mmu. I kept the default as "auto" for now, as that can go on
> > top and I don't want to introduce that change this late in the cycle. If no one
> > beats me to the punch (hint, hint ;-) ), I'll post a patch to make "never" the
> > default for unaffected hosts so that we can discuss/consider that change for 6.6.
>
> Thanks Sean, I agree with the plan. I could give a try on the patch if you'd like.

Yes please, thanks!
Re: [PATCH] KVM: x86/mmu: Add "never" option to allow sticky disabling of nx_huge_pages [ In reply to ]
On 2023/6/15 03:07, Sean Christopherson wrote:
> On Wed, Jun 14, 2023, Luiz Capitulino wrote:
>>
>>
>> On 2023-06-13 19:21, Sean Christopherson wrote:
>>
>>>
>>>
>>>
>>> On Thu, 01 Jun 2023 17:58:59 -0700, Sean Christopherson wrote:
>>>> Add a "never" option to the nx_huge_pages module param to allow userspace
>>>> to do a one-way hard disabling of the mitigation, and don't create the
>>>> per-VM recovery threads when the mitigation is hard disabled. Letting
>>>> userspace pinky swear that userspace doesn't want to enable NX mitigation
>>>> (without reloading KVM) allows certain use cases to avoid the latency
>>>> problems associated with spawning a kthread for each VM.
>>>>
>>>> [...]
>>>
>>> Applied to kvm-x86 mmu. I kept the default as "auto" for now, as that can go on
>>> top and I don't want to introduce that change this late in the cycle. If no one
>>> beats me to the punch (hint, hint ;-) ), I'll post a patch to make "never" the
>>> default for unaffected hosts so that we can discuss/consider that change for 6.6.
>>
>> Thanks Sean, I agree with the plan. I could give a try on the patch if you'd like.
>
> Yes please, thanks!
>
>

As a KVM/x86 *feature*, playing with splitting and reconstructing large
pages have other potential user scenarios, e.g. for performance test
comparisons in a easier approach, not just for itlb_multihit mitigation.

On unaffected machines (ICX and later), nx_huge_pages is already "N",
and turning it into "never" doesn't help materially in the mitigation
implementation, but loses flexibility.

IMO, the real issue here is that the kernel thread "kvm-nx-lpage-
recovery" is created unconditionally. We also need to be aware of the
existence of this commit 084cc29f8bbb ("KVM: x86/MMU: Allow NX huge
pages to be disabled on a per-vm basis").

One of the technical proposals is to defer kvm_vm_create_worker_thread()
to kvm_mmu_create() or kvm_init_mmu(), based on
kvm->arch.disable_nx_huge_pages, even until guest paging mode is enabled
on the first vcpu.

Is this step worth taking ?
Re: [PATCH] KVM: x86/mmu: Add "never" option to allow sticky disabling of nx_huge_pages [ In reply to ]
On Wed, Jul 12, 2023, Like Xu wrote:
> On 2023/6/15 03:07, Sean Christopherson wrote:
> > On Wed, Jun 14, 2023, Luiz Capitulino wrote:
> > > > Applied to kvm-x86 mmu. I kept the default as "auto" for now, as that can go on
> > > > top and I don't want to introduce that change this late in the cycle. If no one
> > > > beats me to the punch (hint, hint ;-) ), I'll post a patch to make "never" the
> > > > default for unaffected hosts so that we can discuss/consider that change for 6.6.
> > >
> > > Thanks Sean, I agree with the plan. I could give a try on the patch if you'd like.
> >
> > Yes please, thanks!
>
> As a KVM/x86 *feature*, playing with splitting and reconstructing large
> pages have other potential user scenarios, e.g. for performance test
> comparisons in a easier approach, not just for itlb_multihit mitigation.

Enabling and disabling dirty logging is a far better tool for that, as it gives
userspace much more explicit control over what pages are are split/reconstituted,
and when.

> On unaffected machines (ICX and later), nx_huge_pages is already "N",
> and turning it into "never" doesn't help materially in the mitigation
> implementation, but loses flexibility.

I'm becoming more and more convinced that losing the flexibility is perfectly
acceptable. There's a very good argument to be made that mitigating DoS attacks
from the guest kernel should be done several levels up, e.g. by refusing to create
VMs for a customer that is bringing down hosts. As Jim has a pointed out, plugging
the hole only works if you are 100% confident there are no other holes, and will
never be other holes.

> IMO, the real issue here is that the kernel thread "kvm-nx-lpage-
> recovery" is created unconditionally. We also need to be aware of the
> existence of this commit 084cc29f8bbb ("KVM: x86/MMU: Allow NX huge
> pages to be disabled on a per-vm basis").
>
> One of the technical proposals is to defer kvm_vm_create_worker_thread()
> to kvm_mmu_create() or kvm_init_mmu(), based on
> kvm->arch.disable_nx_huge_pages, even until guest paging mode is enabled
> on the first vcpu.
>
> Is this step worth taking ?

IMO, no. In hindsight, adding KVM_CAP_VM_DISABLE_NX_HUGE_PAGES was likely a
mistake; requiring CAP_SYS_BOOT makes it annoyingly difficult to safely use the
capability. My preference at this point is to make changes to the NX hugepage
mitigation only when there is a substantial benefit to an already-deployed usecase.