Mailing List Archive

[PATCH 0/5] kvm: fix latent guest entry/exit bugs
Several architectures have latent bugs around guest entry/exit, most
notably:

1) Several architectures enable interrupts between guest_enter() and
guest_exit(). As this period is an RCU extended quiescent state (EQS) this
is unsound unless the irq entry code explicitly wakes RCU, which most
architectures only do for entry from usersapce or idle.

I believe this affects: arm64, riscv, s390

I am not sure about powerpc.

2) Several architectures permit instrumentation of code between
guest_enter() and guest_exit(), e.g. KASAN, KCOV, KCSAN, etc. As
instrumentation may directly o indirectly use RCU, this has the same
problems as with interrupts.

I believe this affects: arm64, mips, powerpc, riscv, s390

3) Several architectures do not inform lockdep and tracing that
interrupts are enabled during the execution of the guest, or do so in
an incorrect order. Generally
this means that logs will report IRQs being masked for much longer
than is actually the case, which is not ideal for debugging. I don't
know whether this affects the correctness of lockdep.

I believe this affects: arm64, mips, powerpc, riscv, s390

This was previously fixed for x86 specifically in a series of commits:

87fa7f3e98a1310e ("x86/kvm: Move context tracking where it belongs")
0642391e2139a2c1 ("x86/kvm/vmx: Add hardirq tracing to guest enter/exit")
9fc975e9efd03e57 ("x86/kvm/svm: Add hardirq tracing on guest enter/exit")
3ebccdf373c21d86 ("x86/kvm/vmx: Move guest enter/exit into .noinstr.text")
135961e0a7d555fc ("x86/kvm/svm: Move guest enter/exit into .noinstr.text")
160457140187c5fb ("KVM: x86: Defer vtime accounting 'til after IRQ handling")
bc908e091b326467 ("KVM: x86: Consolidate guest enter/exit logic to common helpers")

But other architectures were left broken, and the infrastructure for
handling this correctly is x86-specific.

This series introduces generic helper functions which can be used to
handle the problems above, and migrates architectures over to these,
fixing the latent issues.

I wasn't able to figure my way around powerpc and s390, so I have not
altered these. I'd appreciate if anyone could take a look at those
cases, and either have a go at patches or provide some feedback as to
any alternative approaches which work work better there.

I have build-tested the arm64, mips, riscv, and x86 cases, but I don't
have a suitable HW setup to test these, so any review and/or testing
would be much appreciated.

I've pushed the series (based on v5.16) to my kvm/entry-rework branch:

https://git.kernel.org/pub/scm/linux/kernel/git/mark/linux.git/log/?h=kvm/entry-rework
git://git.kernel.org/pub/scm/linux/kernel/git/mark/linux.git kvm/entry-rework

... also tagged as kvm-entry-rework-20210111

Thanks,
Mark.

Mark Rutland (5):
kvm: add exit_to_guest_mode() and enter_from_guest_mode()
kvm/arm64: rework guest entry logic
kvm/mips: rework guest entry logic
kvm/riscv: rework guest entry logic
kvm/x86: rework guest entry logic

arch/arm64/kvm/arm.c | 51 +++++++++++-------
arch/mips/kvm/mips.c | 37 ++++++++++++--
arch/riscv/kvm/vcpu.c | 44 ++++++++++------
arch/x86/kvm/svm/svm.c | 4 +-
arch/x86/kvm/vmx/vmx.c | 4 +-
arch/x86/kvm/x86.c | 4 +-
arch/x86/kvm/x86.h | 45 ----------------
include/linux/kvm_host.h | 108 +++++++++++++++++++++++++++++++++++++--
8 files changed, 206 insertions(+), 91 deletions(-)

--
2.30.2
Re: [PATCH 0/5] kvm: fix latent guest entry/exit bugs [ In reply to ]
On Tue, 11 Jan 2022 07:35:34 PST (-0800), mark.rutland@arm.com wrote:
> Several architectures have latent bugs around guest entry/exit, most
> notably:
>
> 1) Several architectures enable interrupts between guest_enter() and
> guest_exit(). As this period is an RCU extended quiescent state (EQS) this
> is unsound unless the irq entry code explicitly wakes RCU, which most
> architectures only do for entry from usersapce or idle.
>
> I believe this affects: arm64, riscv, s390
>
> I am not sure about powerpc.
>
> 2) Several architectures permit instrumentation of code between
> guest_enter() and guest_exit(), e.g. KASAN, KCOV, KCSAN, etc. As
> instrumentation may directly o indirectly use RCU, this has the same
> problems as with interrupts.
>
> I believe this affects: arm64, mips, powerpc, riscv, s390

Moving to Atish and Anup's new email addresses, looks like MAINTAINERS
hasn't been updated yet. I thought I remembering seeing patches getting
picked up for these, but LMK if you guys were expecting me to send them
along -- sorry if I misunderstood!

>
> 3) Several architectures do not inform lockdep and tracing that
> interrupts are enabled during the execution of the guest, or do so in
> an incorrect order. Generally
> this means that logs will report IRQs being masked for much longer
> than is actually the case, which is not ideal for debugging. I don't
> know whether this affects the correctness of lockdep.
>
> I believe this affects: arm64, mips, powerpc, riscv, s390
>
> This was previously fixed for x86 specifically in a series of commits:
>
> 87fa7f3e98a1310e ("x86/kvm: Move context tracking where it belongs")
> 0642391e2139a2c1 ("x86/kvm/vmx: Add hardirq tracing to guest enter/exit")
> 9fc975e9efd03e57 ("x86/kvm/svm: Add hardirq tracing on guest enter/exit")
> 3ebccdf373c21d86 ("x86/kvm/vmx: Move guest enter/exit into .noinstr.text")
> 135961e0a7d555fc ("x86/kvm/svm: Move guest enter/exit into .noinstr.text")
> 160457140187c5fb ("KVM: x86: Defer vtime accounting 'til after IRQ handling")
> bc908e091b326467 ("KVM: x86: Consolidate guest enter/exit logic to common helpers")
>
> But other architectures were left broken, and the infrastructure for
> handling this correctly is x86-specific.
>
> This series introduces generic helper functions which can be used to
> handle the problems above, and migrates architectures over to these,
> fixing the latent issues.
>
> I wasn't able to figure my way around powerpc and s390, so I have not
> altered these. I'd appreciate if anyone could take a look at those
> cases, and either have a go at patches or provide some feedback as to
> any alternative approaches which work work better there.
>
> I have build-tested the arm64, mips, riscv, and x86 cases, but I don't
> have a suitable HW setup to test these, so any review and/or testing
> would be much appreciated.
>
> I've pushed the series (based on v5.16) to my kvm/entry-rework branch:
>
> https://git.kernel.org/pub/scm/linux/kernel/git/mark/linux.git/log/?h=kvm/entry-rework
> git://git.kernel.org/pub/scm/linux/kernel/git/mark/linux.git kvm/entry-rework
>
> ... also tagged as kvm-entry-rework-20210111
>
> Thanks,
> Mark.
>
> Mark Rutland (5):
> kvm: add exit_to_guest_mode() and enter_from_guest_mode()
> kvm/arm64: rework guest entry logic
> kvm/mips: rework guest entry logic
> kvm/riscv: rework guest entry logic
> kvm/x86: rework guest entry logic
>
> arch/arm64/kvm/arm.c | 51 +++++++++++-------
> arch/mips/kvm/mips.c | 37 ++++++++++++--
> arch/riscv/kvm/vcpu.c | 44 ++++++++++------
> arch/x86/kvm/svm/svm.c | 4 +-
> arch/x86/kvm/vmx/vmx.c | 4 +-
> arch/x86/kvm/x86.c | 4 +-
> arch/x86/kvm/x86.h | 45 ----------------
> include/linux/kvm_host.h | 108 +++++++++++++++++++++++++++++++++++++--
> 8 files changed, 206 insertions(+), 91 deletions(-)
Re: [PATCH 0/5] kvm: fix latent guest entry/exit bugs [ In reply to ]
Am 11.01.22 um 16:35 schrieb Mark Rutland:
> Several architectures have latent bugs around guest entry/exit, most
> notably:
>
> 1) Several architectures enable interrupts between guest_enter() and
> guest_exit(). As this period is an RCU extended quiescent state (EQS) this
> is unsound unless the irq entry code explicitly wakes RCU, which most
> architectures only do for entry from usersapce or idle.
>
> I believe this affects: arm64, riscv, s390
>
> I am not sure about powerpc.
>
> 2) Several architectures permit instrumentation of code between
> guest_enter() and guest_exit(), e.g. KASAN, KCOV, KCSAN, etc. As
> instrumentation may directly o indirectly use RCU, this has the same
> problems as with interrupts.
>
> I believe this affects: arm64, mips, powerpc, riscv, s390
>
> 3) Several architectures do not inform lockdep and tracing that
> interrupts are enabled during the execution of the guest, or do so in
> an incorrect order. Generally
> this means that logs will report IRQs being masked for much longer
> than is actually the case, which is not ideal for debugging. I don't
> know whether this affects the correctness of lockdep.
>
> I believe this affects: arm64, mips, powerpc, riscv, s390
>
> This was previously fixed for x86 specifically in a series of commits:
>
> 87fa7f3e98a1310e ("x86/kvm: Move context tracking where it belongs")
> 0642391e2139a2c1 ("x86/kvm/vmx: Add hardirq tracing to guest enter/exit")
> 9fc975e9efd03e57 ("x86/kvm/svm: Add hardirq tracing on guest enter/exit")
> 3ebccdf373c21d86 ("x86/kvm/vmx: Move guest enter/exit into .noinstr.text")
> 135961e0a7d555fc ("x86/kvm/svm: Move guest enter/exit into .noinstr.text")
> 160457140187c5fb ("KVM: x86: Defer vtime accounting 'til after IRQ handling")
> bc908e091b326467 ("KVM: x86: Consolidate guest enter/exit logic to common helpers")
>
> But other architectures were left broken, and the infrastructure for
> handling this correctly is x86-specific.
>
> This series introduces generic helper functions which can be used to
> handle the problems above, and migrates architectures over to these,
> fixing the latent issues.
>
> I wasn't able to figure my way around powerpc and s390, so I have not

I think 2 later patches have moved the guest_enter/exit a bit out.
Does this make the s390 code clearer?

diff --git a/arch/s390/kvm/kvm-s390.c b/arch/s390/kvm/kvm-s390.c
index 577f1ead6a51..5859207c2cc0 100644
--- a/arch/s390/kvm/kvm-s390.c
+++ b/arch/s390/kvm/kvm-s390.c
@@ -4145,10 +4145,6 @@ static int __vcpu_run(struct kvm_vcpu *vcpu)
* As PF_VCPU will be used in fault handler, between
* guest_enter and guest_exit should be no uaccess.
*/
- local_irq_disable();
- guest_enter_irqoff();
- __disable_cpu_timer_accounting(vcpu);
- local_irq_enable();
if (kvm_s390_pv_cpu_is_protected(vcpu)) {
memcpy(sie_page->pv_grregs,
vcpu->run->s.regs.gprs,
@@ -4156,8 +4152,16 @@ static int __vcpu_run(struct kvm_vcpu *vcpu)
}
if (test_cpu_flag(CIF_FPU))
load_fpu_regs();
+ local_irq_disable();
+ __disable_cpu_timer_accounting(vcpu);
+ guest_enter_irqoff();
+ local_irq_enable();
exit_reason = sie64a(vcpu->arch.sie_block,
vcpu->run->s.regs.gprs);
+ local_irq_disable();
+ guest_exit_irqoff();
+ __enable_cpu_timer_accounting(vcpu);
+ local_irq_enable();
if (kvm_s390_pv_cpu_is_protected(vcpu)) {
memcpy(vcpu->run->s.regs.gprs,
sie_page->pv_grregs,
@@ -4173,10 +4177,6 @@ static int __vcpu_run(struct kvm_vcpu *vcpu)
vcpu->arch.sie_block->gpsw.mask &= ~PSW_INT_MASK;
}
}
- local_irq_disable();
- __enable_cpu_timer_accounting(vcpu);
- guest_exit_irqoff();
- local_irq_enable();
vcpu->srcu_idx = srcu_read_lock(&vcpu->kvm->srcu);

rc = vcpu_post_run(vcpu, exit_reason);
Re: [PATCH 0/5] kvm: fix latent guest entry/exit bugs [ In reply to ]
On Thu, Jan 13, 2022 at 04:20:07PM +0100, Christian Borntraeger wrote:
>
>
> Am 11.01.22 um 16:35 schrieb Mark Rutland:
> > Several architectures have latent bugs around guest entry/exit, most
> > notably:
> >
> > 1) Several architectures enable interrupts between guest_enter() and
> > guest_exit(). As this period is an RCU extended quiescent state (EQS) this
> > is unsound unless the irq entry code explicitly wakes RCU, which most
> > architectures only do for entry from usersapce or idle.
> >
> > I believe this affects: arm64, riscv, s390
> >
> > I am not sure about powerpc.
> >
> > 2) Several architectures permit instrumentation of code between
> > guest_enter() and guest_exit(), e.g. KASAN, KCOV, KCSAN, etc. As
> > instrumentation may directly o indirectly use RCU, this has the same
> > problems as with interrupts.
> >
> > I believe this affects: arm64, mips, powerpc, riscv, s390
> >
> > 3) Several architectures do not inform lockdep and tracing that
> > interrupts are enabled during the execution of the guest, or do so in
> > an incorrect order. Generally
> > this means that logs will report IRQs being masked for much longer
> > than is actually the case, which is not ideal for debugging. I don't
> > know whether this affects the correctness of lockdep.
> >
> > I believe this affects: arm64, mips, powerpc, riscv, s390
> >
> > This was previously fixed for x86 specifically in a series of commits:
> >
> > 87fa7f3e98a1310e ("x86/kvm: Move context tracking where it belongs")
> > 0642391e2139a2c1 ("x86/kvm/vmx: Add hardirq tracing to guest enter/exit")
> > 9fc975e9efd03e57 ("x86/kvm/svm: Add hardirq tracing on guest enter/exit")
> > 3ebccdf373c21d86 ("x86/kvm/vmx: Move guest enter/exit into .noinstr.text")
> > 135961e0a7d555fc ("x86/kvm/svm: Move guest enter/exit into .noinstr.text")
> > 160457140187c5fb ("KVM: x86: Defer vtime accounting 'til after IRQ handling")
> > bc908e091b326467 ("KVM: x86: Consolidate guest enter/exit logic to common helpers")
> >
> > But other architectures were left broken, and the infrastructure for
> > handling this correctly is x86-specific.
> >
> > This series introduces generic helper functions which can be used to
> > handle the problems above, and migrates architectures over to these,
> > fixing the latent issues.
> >
> > I wasn't able to figure my way around powerpc and s390, so I have not
>
> I think 2 later patches have moved the guest_enter/exit a bit out.
> Does this make the s390 code clearer?

Yes; that's much simpler to follow!

One major thing I wasn't sure about for s390 is the sequence:

guest_enter_irqoff(); // Enters an RCU EQS
...
local_irq_enable();
...
sie64a(...);
...
local_irq_disable();
...
guest_exit_irqoff(); // Exits an RCU EQS

... since if an IRQ is taken between local_irq_{enable,disable}(), RCU won't be
watching, and I couldn't spot whether your regular IRQ entry logic would wake
RCU in this case, or whether there was something else I'm missing that saves
you here.

For other architectures, including x86 and arm64, we enter the guest with IRQs
masked and return from the guest with IRQs masked, and don't actually take IRQs
until we unmask them in the host, after the guest_exit_*() logic has woken RCU
and so on.

I wasn't able to find documentation on the semantics of SIE, so I couldn't spot
whether the local_irq_{enable,disable}() calls were necessary, or could be
removed.

Thanks,
Mark.

> diff --git a/arch/s390/kvm/kvm-s390.c b/arch/s390/kvm/kvm-s390.c
> index 577f1ead6a51..5859207c2cc0 100644
> --- a/arch/s390/kvm/kvm-s390.c
> +++ b/arch/s390/kvm/kvm-s390.c
> @@ -4145,10 +4145,6 @@ static int __vcpu_run(struct kvm_vcpu *vcpu)
> * As PF_VCPU will be used in fault handler, between
> * guest_enter and guest_exit should be no uaccess.
> */
> - local_irq_disable();
> - guest_enter_irqoff();
> - __disable_cpu_timer_accounting(vcpu);
> - local_irq_enable();
> if (kvm_s390_pv_cpu_is_protected(vcpu)) {
> memcpy(sie_page->pv_grregs,
> vcpu->run->s.regs.gprs,
> @@ -4156,8 +4152,16 @@ static int __vcpu_run(struct kvm_vcpu *vcpu)
> }
> if (test_cpu_flag(CIF_FPU))
> load_fpu_regs();
> + local_irq_disable();
> + __disable_cpu_timer_accounting(vcpu);
> + guest_enter_irqoff();
> + local_irq_enable();
> exit_reason = sie64a(vcpu->arch.sie_block,
> vcpu->run->s.regs.gprs);
> + local_irq_disable();
> + guest_exit_irqoff();
> + __enable_cpu_timer_accounting(vcpu);
> + local_irq_enable();
> if (kvm_s390_pv_cpu_is_protected(vcpu)) {
> memcpy(vcpu->run->s.regs.gprs,
> sie_page->pv_grregs,
> @@ -4173,10 +4177,6 @@ static int __vcpu_run(struct kvm_vcpu *vcpu)
> vcpu->arch.sie_block->gpsw.mask &= ~PSW_INT_MASK;
> }
> }
> - local_irq_disable();
> - __enable_cpu_timer_accounting(vcpu);
> - guest_exit_irqoff();
> - local_irq_enable();
> vcpu->srcu_idx = srcu_read_lock(&vcpu->kvm->srcu);
> rc = vcpu_post_run(vcpu, exit_reason);
Re: [PATCH 0/5] kvm: fix latent guest entry/exit bugs [ In reply to ]
Am 14.01.22 um 13:19 schrieb Mark Rutland:
> On Thu, Jan 13, 2022 at 04:20:07PM +0100, Christian Borntraeger wrote:
>>
>>
>> Am 11.01.22 um 16:35 schrieb Mark Rutland:
>>> Several architectures have latent bugs around guest entry/exit, most
>>> notably:
>>>
>>> 1) Several architectures enable interrupts between guest_enter() and
>>> guest_exit(). As this period is an RCU extended quiescent state (EQS) this
>>> is unsound unless the irq entry code explicitly wakes RCU, which most
>>> architectures only do for entry from usersapce or idle.
>>>
>>> I believe this affects: arm64, riscv, s390
>>>
>>> I am not sure about powerpc.
>>>
>>> 2) Several architectures permit instrumentation of code between
>>> guest_enter() and guest_exit(), e.g. KASAN, KCOV, KCSAN, etc. As
>>> instrumentation may directly o indirectly use RCU, this has the same
>>> problems as with interrupts.
>>>
>>> I believe this affects: arm64, mips, powerpc, riscv, s390
>>>
>>> 3) Several architectures do not inform lockdep and tracing that
>>> interrupts are enabled during the execution of the guest, or do so in
>>> an incorrect order. Generally
>>> this means that logs will report IRQs being masked for much longer
>>> than is actually the case, which is not ideal for debugging. I don't
>>> know whether this affects the correctness of lockdep.
>>>
>>> I believe this affects: arm64, mips, powerpc, riscv, s390
>>>
>>> This was previously fixed for x86 specifically in a series of commits:
>>>
>>> 87fa7f3e98a1310e ("x86/kvm: Move context tracking where it belongs")
>>> 0642391e2139a2c1 ("x86/kvm/vmx: Add hardirq tracing to guest enter/exit")
>>> 9fc975e9efd03e57 ("x86/kvm/svm: Add hardirq tracing on guest enter/exit")
>>> 3ebccdf373c21d86 ("x86/kvm/vmx: Move guest enter/exit into .noinstr.text")
>>> 135961e0a7d555fc ("x86/kvm/svm: Move guest enter/exit into .noinstr.text")
>>> 160457140187c5fb ("KVM: x86: Defer vtime accounting 'til after IRQ handling")
>>> bc908e091b326467 ("KVM: x86: Consolidate guest enter/exit logic to common helpers")
>>>
>>> But other architectures were left broken, and the infrastructure for
>>> handling this correctly is x86-specific.
>>>
>>> This series introduces generic helper functions which can be used to
>>> handle the problems above, and migrates architectures over to these,
>>> fixing the latent issues.
>>>
>>> I wasn't able to figure my way around powerpc and s390, so I have not
>>
>> I think 2 later patches have moved the guest_enter/exit a bit out.
>> Does this make the s390 code clearer?
>
> Yes; that's much simpler to follow!
>
> One major thing I wasn't sure about for s390 is the sequence:
>
> guest_enter_irqoff(); // Enters an RCU EQS
> ...
> local_irq_enable();
> ...
> sie64a(...);
> ...
> local_irq_disable();
> ...
> guest_exit_irqoff(); // Exits an RCU EQS
>
> ... since if an IRQ is taken between local_irq_{enable,disable}(), RCU won't be
> watching, and I couldn't spot whether your regular IRQ entry logic would wake
> RCU in this case, or whether there was something else I'm missing that saves
> you here.
>
> For other architectures, including x86 and arm64, we enter the guest with IRQs
> masked and return from the guest with IRQs masked, and don't actually take IRQs
> until we unmask them in the host, after the guest_exit_*() logic has woken RCU
> and so on.
>
> I wasn't able to find documentation on the semantics of SIE, so I couldn't spot
> whether the local_irq_{enable,disable}() calls were necessary, or could be
> removed.

We run the SIE instruction with interrupts enabled. SIE is interruptible.
The disable/enable pairs are just because guest_enter/exit_irqoff() require them.
One thing to be aware of: in our entry.S - after an interrupt - we leave SIE by
setting the return address of the interrupt after the sie instruction so that we
get back into this __vcpu_run loop to check for signals and so.


>
> Thanks,
> Mark.
>
>> diff --git a/arch/s390/kvm/kvm-s390.c b/arch/s390/kvm/kvm-s390.c
>> index 577f1ead6a51..5859207c2cc0 100644
>> --- a/arch/s390/kvm/kvm-s390.c
>> +++ b/arch/s390/kvm/kvm-s390.c
>> @@ -4145,10 +4145,6 @@ static int __vcpu_run(struct kvm_vcpu *vcpu)
>> * As PF_VCPU will be used in fault handler, between
>> * guest_enter and guest_exit should be no uaccess.
>> */
>> - local_irq_disable();
>> - guest_enter_irqoff();
>> - __disable_cpu_timer_accounting(vcpu);
>> - local_irq_enable();
>> if (kvm_s390_pv_cpu_is_protected(vcpu)) {
>> memcpy(sie_page->pv_grregs,
>> vcpu->run->s.regs.gprs,
>> @@ -4156,8 +4152,16 @@ static int __vcpu_run(struct kvm_vcpu *vcpu)
>> }
>> if (test_cpu_flag(CIF_FPU))
>> load_fpu_regs();
>> + local_irq_disable();
>> + __disable_cpu_timer_accounting(vcpu);
>> + guest_enter_irqoff();
>> + local_irq_enable();
>> exit_reason = sie64a(vcpu->arch.sie_block,
>> vcpu->run->s.regs.gprs);
>> + local_irq_disable();
>> + guest_exit_irqoff();
>> + __enable_cpu_timer_accounting(vcpu);
>> + local_irq_enable();
>> if (kvm_s390_pv_cpu_is_protected(vcpu)) {
>> memcpy(vcpu->run->s.regs.gprs,
>> sie_page->pv_grregs,
>> @@ -4173,10 +4177,6 @@ static int __vcpu_run(struct kvm_vcpu *vcpu)
>> vcpu->arch.sie_block->gpsw.mask &= ~PSW_INT_MASK;
>> }
>> }
>> - local_irq_disable();
>> - __enable_cpu_timer_accounting(vcpu);
>> - guest_exit_irqoff();
>> - local_irq_enable();
>> vcpu->srcu_idx = srcu_read_lock(&vcpu->kvm->srcu);
>> rc = vcpu_post_run(vcpu, exit_reason);
Re: [PATCH 0/5] kvm: fix latent guest entry/exit bugs [ In reply to ]
On Fri, Jan 14, 2022 at 01:29:46PM +0100, Christian Borntraeger wrote:
>
>
> Am 14.01.22 um 13:19 schrieb Mark Rutland:
> > On Thu, Jan 13, 2022 at 04:20:07PM +0100, Christian Borntraeger wrote:
> > > Am 11.01.22 um 16:35 schrieb Mark Rutland:
> > > > Several architectures have latent bugs around guest entry/exit, most
> > > > notably:
> > > >
> > > > 1) Several architectures enable interrupts between guest_enter() and
> > > > guest_exit(). As this period is an RCU extended quiescent state (EQS) this
> > > > is unsound unless the irq entry code explicitly wakes RCU, which most
> > > > architectures only do for entry from usersapce or idle.
> > > >
> > > > I believe this affects: arm64, riscv, s390
> > > >
> > > > I am not sure about powerpc.
> > > >
> > > > 2) Several architectures permit instrumentation of code between
> > > > guest_enter() and guest_exit(), e.g. KASAN, KCOV, KCSAN, etc. As
> > > > instrumentation may directly o indirectly use RCU, this has the same
> > > > problems as with interrupts.
> > > >
> > > > I believe this affects: arm64, mips, powerpc, riscv, s390
> > > >
> > > > 3) Several architectures do not inform lockdep and tracing that
> > > > interrupts are enabled during the execution of the guest, or do so in
> > > > an incorrect order. Generally
> > > > this means that logs will report IRQs being masked for much longer
> > > > than is actually the case, which is not ideal for debugging. I don't
> > > > know whether this affects the correctness of lockdep.
> > > >
> > > > I believe this affects: arm64, mips, powerpc, riscv, s390
> > > >
> > > > This was previously fixed for x86 specifically in a series of commits:
> > > >
> > > > 87fa7f3e98a1310e ("x86/kvm: Move context tracking where it belongs")
> > > > 0642391e2139a2c1 ("x86/kvm/vmx: Add hardirq tracing to guest enter/exit")
> > > > 9fc975e9efd03e57 ("x86/kvm/svm: Add hardirq tracing on guest enter/exit")
> > > > 3ebccdf373c21d86 ("x86/kvm/vmx: Move guest enter/exit into .noinstr.text")
> > > > 135961e0a7d555fc ("x86/kvm/svm: Move guest enter/exit into .noinstr.text")
> > > > 160457140187c5fb ("KVM: x86: Defer vtime accounting 'til after IRQ handling")
> > > > bc908e091b326467 ("KVM: x86: Consolidate guest enter/exit logic to common helpers")
> > > >
> > > > But other architectures were left broken, and the infrastructure for
> > > > handling this correctly is x86-specific.
> > > >
> > > > This series introduces generic helper functions which can be used to
> > > > handle the problems above, and migrates architectures over to these,
> > > > fixing the latent issues.
> > > >
> > > > I wasn't able to figure my way around powerpc and s390, so I have not
> > >
> > > I think 2 later patches have moved the guest_enter/exit a bit out.
> > > Does this make the s390 code clearer?
> >
> > Yes; that's much simpler to follow!
> >
> > One major thing I wasn't sure about for s390 is the sequence:
> >
> > guest_enter_irqoff(); // Enters an RCU EQS
> > ...
> > local_irq_enable();
> > ...
> > sie64a(...);
> > ...
> > local_irq_disable();
> > ...
> > guest_exit_irqoff(); // Exits an RCU EQS
> >
> > ... since if an IRQ is taken between local_irq_{enable,disable}(), RCU won't be
> > watching, and I couldn't spot whether your regular IRQ entry logic would wake
> > RCU in this case, or whether there was something else I'm missing that saves
> > you here.
> >
> > For other architectures, including x86 and arm64, we enter the guest with IRQs
> > masked and return from the guest with IRQs masked, and don't actually take IRQs
> > until we unmask them in the host, after the guest_exit_*() logic has woken RCU
> > and so on.
> >
> > I wasn't able to find documentation on the semantics of SIE, so I couldn't spot
> > whether the local_irq_{enable,disable}() calls were necessary, or could be
> > removed.
>
> We run the SIE instruction with interrupts enabled. SIE is interruptible.
> The disable/enable pairs are just because guest_enter/exit_irqoff() require them.

What I was trying to figure out was when an interrupt is taken between
guest_enter_irqoff() and guest_exit_irqoff(), where is RCU woken? I couldn't
spot that in the s390 entry code (probably simply because I'm not familiar with
it), and so AFAICT that means IRQ code could run without RCU watching, which
would cause things to explode.

On other architectures that problem is avoided because IRQs asserted during the
guest cause a specific guest exit rather than a regular IRQ exception, and the
HW enables/disables IRQs when entering/exiting the guest, so the host can leave
IRQs masked across guest_enter_irqoff()..guest_exit_irqoff().

Am I right in understanding that SIE itself won't enable (host) interrupts
while running the guest, and so it *needs* to be run with interrupts already
enabled?

> One thing to be aware of: in our entry.S - after an interrupt - we leave SIE by
> setting the return address of the interrupt after the sie instruction so that we
> get back into this __vcpu_run loop to check for signals and so.

Just to check, that's after the IRQ handler runs, right?

Thanks,
Mark.
Re: [PATCH 0/5] kvm: fix latent guest entry/exit bugs [ In reply to ]
Am 14.01.22 um 14:32 schrieb Mark Rutland:
> On Fri, Jan 14, 2022 at 01:29:46PM +0100, Christian Borntraeger wrote:
>>
>>
>> Am 14.01.22 um 13:19 schrieb Mark Rutland:
>>> On Thu, Jan 13, 2022 at 04:20:07PM +0100, Christian Borntraeger wrote:
>>>> Am 11.01.22 um 16:35 schrieb Mark Rutland:
>>>>> Several architectures have latent bugs around guest entry/exit, most
>>>>> notably:
>>>>>
>>>>> 1) Several architectures enable interrupts between guest_enter() and
>>>>> guest_exit(). As this period is an RCU extended quiescent state (EQS) this
>>>>> is unsound unless the irq entry code explicitly wakes RCU, which most
>>>>> architectures only do for entry from usersapce or idle.
>>>>>
>>>>> I believe this affects: arm64, riscv, s390
>>>>>
>>>>> I am not sure about powerpc.
>>>>>
>>>>> 2) Several architectures permit instrumentation of code between
>>>>> guest_enter() and guest_exit(), e.g. KASAN, KCOV, KCSAN, etc. As
>>>>> instrumentation may directly o indirectly use RCU, this has the same
>>>>> problems as with interrupts.
>>>>>
>>>>> I believe this affects: arm64, mips, powerpc, riscv, s390
>>>>>
>>>>> 3) Several architectures do not inform lockdep and tracing that
>>>>> interrupts are enabled during the execution of the guest, or do so in
>>>>> an incorrect order. Generally
>>>>> this means that logs will report IRQs being masked for much longer
>>>>> than is actually the case, which is not ideal for debugging. I don't
>>>>> know whether this affects the correctness of lockdep.
>>>>>
>>>>> I believe this affects: arm64, mips, powerpc, riscv, s390
>>>>>
>>>>> This was previously fixed for x86 specifically in a series of commits:
>>>>>
>>>>> 87fa7f3e98a1310e ("x86/kvm: Move context tracking where it belongs")
>>>>> 0642391e2139a2c1 ("x86/kvm/vmx: Add hardirq tracing to guest enter/exit")
>>>>> 9fc975e9efd03e57 ("x86/kvm/svm: Add hardirq tracing on guest enter/exit")
>>>>> 3ebccdf373c21d86 ("x86/kvm/vmx: Move guest enter/exit into .noinstr.text")
>>>>> 135961e0a7d555fc ("x86/kvm/svm: Move guest enter/exit into .noinstr.text")
>>>>> 160457140187c5fb ("KVM: x86: Defer vtime accounting 'til after IRQ handling")
>>>>> bc908e091b326467 ("KVM: x86: Consolidate guest enter/exit logic to common helpers")
>>>>>
>>>>> But other architectures were left broken, and the infrastructure for
>>>>> handling this correctly is x86-specific.
>>>>>
>>>>> This series introduces generic helper functions which can be used to
>>>>> handle the problems above, and migrates architectures over to these,
>>>>> fixing the latent issues.
>>>>>
>>>>> I wasn't able to figure my way around powerpc and s390, so I have not
>>>>
>>>> I think 2 later patches have moved the guest_enter/exit a bit out.
>>>> Does this make the s390 code clearer?
>>>
>>> Yes; that's much simpler to follow!
>>>
>>> One major thing I wasn't sure about for s390 is the sequence:
>>>
>>> guest_enter_irqoff(); // Enters an RCU EQS
>>> ...
>>> local_irq_enable();
>>> ...
>>> sie64a(...);
>>> ...
>>> local_irq_disable();
>>> ...
>>> guest_exit_irqoff(); // Exits an RCU EQS
>>>
>>> ... since if an IRQ is taken between local_irq_{enable,disable}(), RCU won't be
>>> watching, and I couldn't spot whether your regular IRQ entry logic would wake
>>> RCU in this case, or whether there was something else I'm missing that saves
>>> you here.
>>>
>>> For other architectures, including x86 and arm64, we enter the guest with IRQs
>>> masked and return from the guest with IRQs masked, and don't actually take IRQs
>>> until we unmask them in the host, after the guest_exit_*() logic has woken RCU
>>> and so on.
>>>
>>> I wasn't able to find documentation on the semantics of SIE, so I couldn't spot
>>> whether the local_irq_{enable,disable}() calls were necessary, or could be
>>> removed.
>>
>> We run the SIE instruction with interrupts enabled. SIE is interruptible.
>> The disable/enable pairs are just because guest_enter/exit_irqoff() require them.
>
> What I was trying to figure out was when an interrupt is taken between
> guest_enter_irqoff() and guest_exit_irqoff(), where is RCU woken? I couldn't
> spot that in the s390 entry code (probably simply because I'm not familiar with
> it), and so AFAICT that means IRQ code could run without RCU watching, which
> would cause things to explode.
>
> On other architectures that problem is avoided because IRQs asserted during the
> guest cause a specific guest exit rather than a regular IRQ exception, and the
> HW enables/disables IRQs when entering/exiting the guest, so the host can leave
> IRQs masked across guest_enter_irqoff()..guest_exit_irqoff().
>
> Am I right in understanding that SIE itself won't enable (host) interrupts
> while running the guest, and so it *needs* to be run with interrupts already
> enabled?

yes

>
>> One thing to be aware of: in our entry.S - after an interrupt - we leave SIE by
>> setting the return address of the interrupt after the sie instruction so that we
>> get back into this __vcpu_run loop to check for signals and so.
>
> Just to check, that's after the IRQ handler runs, right?

and yes.
Re: [PATCH 0/5] kvm: fix latent guest entry/exit bugs [ In reply to ]
On Fri, Jan 14, 2022 at 02:51:38PM +0100, Christian Borntraeger wrote:
> Am 14.01.22 um 14:32 schrieb Mark Rutland:
> > On Fri, Jan 14, 2022 at 01:29:46PM +0100, Christian Borntraeger wrote:
> > > Am 14.01.22 um 13:19 schrieb Mark Rutland:
> > > > On Thu, Jan 13, 2022 at 04:20:07PM +0100, Christian Borntraeger wrote:
> > > > > Am 11.01.22 um 16:35 schrieb Mark Rutland:

[...]

> > > > One major thing I wasn't sure about for s390 is the sequence:
> > > >
> > > > guest_enter_irqoff(); // Enters an RCU EQS
> > > > ...
> > > > local_irq_enable();
> > > > ...
> > > > sie64a(...);
> > > > ...
> > > > local_irq_disable();
> > > > ...
> > > > guest_exit_irqoff(); // Exits an RCU EQS
> > > >
> > > > ... since if an IRQ is taken between local_irq_{enable,disable}(), RCU won't be
> > > > watching, and I couldn't spot whether your regular IRQ entry logic would wake
> > > > RCU in this case, or whether there was something else I'm missing that saves
> > > > you here.
> > > >
> > > > For other architectures, including x86 and arm64, we enter the guest with IRQs
> > > > masked and return from the guest with IRQs masked, and don't actually take IRQs
> > > > until we unmask them in the host, after the guest_exit_*() logic has woken RCU
> > > > and so on.
> > > >
> > > > I wasn't able to find documentation on the semantics of SIE, so I couldn't spot
> > > > whether the local_irq_{enable,disable}() calls were necessary, or could be
> > > > removed.
> > >
> > > We run the SIE instruction with interrupts enabled. SIE is interruptible.
> > > The disable/enable pairs are just because guest_enter/exit_irqoff() require them.
> >
> > What I was trying to figure out was when an interrupt is taken between
> > guest_enter_irqoff() and guest_exit_irqoff(), where is RCU woken? I couldn't
> > spot that in the s390 entry code (probably simply because I'm not familiar with
> > it), and so AFAICT that means IRQ code could run without RCU watching, which
> > would cause things to explode.
> >
> > On other architectures that problem is avoided because IRQs asserted during the
> > guest cause a specific guest exit rather than a regular IRQ exception, and the
> > HW enables/disables IRQs when entering/exiting the guest, so the host can leave
> > IRQs masked across guest_enter_irqoff()..guest_exit_irqoff().
> >
> > Am I right in understanding that SIE itself won't enable (host) interrupts
> > while running the guest, and so it *needs* to be run with interrupts already
> > enabled?
>
> yes
>
> > > One thing to be aware of: in our entry.S - after an interrupt - we leave SIE by
> > > setting the return address of the interrupt after the sie instruction so that we
> > > get back into this __vcpu_run loop to check for signals and so.
> >
> > Just to check, that's after the IRQ handler runs, right?
>
> and yes.

Thanks for confirming!

IIUC as above, that means there's a latent RCU bug on s390, and to fix that
we'll need to add something to the IRQ entry logic to wake RCU for any IRQ
taken in the EQS between guest_enter_irqoff() and guest_exit_irqoff(), similar
to what is done for IRQs taken from an idle EQS.

I see s390 uses the common irqentry_{enter,exit}(), so perhaps we could extend
the logic there to check something in addition to is_idle_task()? e.g. add a
noinstr helper to check kvm_running_vcpu, Or add a thread flag that says we're
in this guest EQS.

I also think there is another issue here. When an IRQ is taken from SIE, will
user_mode(regs) always be false, or could it be true if the guest userspace is
running? If it can be true I think tha context tracking checks can complain,
and it *might* be possible to trigger a panic().

In irqentry_enter(), if user_mode(regs) == true, we call
irqentry_enter_from_user_mode -> __enter_from_user_mode(). There we check that
the context is CONTEXT_USER, but IIUC that will be CONTEXT_GUEST at this point.
We also call arch_check_user_regs(), and IIUC this might permit a malicious
guest to trigger a host panic by way of debug_user_asce(), but I may have
misunderstood and that might not be possible.

Thanks,
Mark.
Re: [PATCH 0/5] kvm: fix latent guest entry/exit bugs [ In reply to ]
On 1/14/22 16:19, Mark Rutland wrote:
> I also think there is another issue here. When an IRQ is taken from SIE, will
> user_mode(regs) always be false, or could it be true if the guest userspace is
> running? If it can be true I think tha context tracking checks can complain,
> and it*might* be possible to trigger a panic().

I think that it would be false, because the guest PSW is in the SIE
block and switched on SIE entry and exit, but I might be incorrect.

Paolo

> In irqentry_enter(), if user_mode(regs) == true, we call
> irqentry_enter_from_user_mode -> __enter_from_user_mode(). There we check that
> the context is CONTEXT_USER, but IIUC that will be CONTEXT_GUEST at this point.
> We also call arch_check_user_regs(), and IIUC this might permit a malicious
> guest to trigger a host panic by way of debug_user_asce(), but I may have
> misunderstood and that might not be possible.
Re: [PATCH 0/5] kvm: fix latent guest entry/exit bugs [ In reply to ]
On Mon, Jan 17, 2022 at 06:45:36PM +0100, Paolo Bonzini wrote:
> On 1/14/22 16:19, Mark Rutland wrote:
> > I also think there is another issue here. When an IRQ is taken from SIE, will
> > user_mode(regs) always be false, or could it be true if the guest userspace is
> > running? If it can be true I think tha context tracking checks can complain,
> > and it*might* be possible to trigger a panic().
>
> I think that it would be false, because the guest PSW is in the SIE block
> and switched on SIE entry and exit, but I might be incorrect.

Ah; that's the crux of my confusion: I had thought the guest PSW would
be placed in the regular lowcore *_old_psw slots. From looking at the
entry asm it looks like the host PSW (around the invocation of SIE) is
stored there, since that's what the OUTSIDE + SIEEXIT handling is
checking for.

Assuming that's correct, I agree this problem doesn't exist, and there's
only the common RCU/tracing/lockdep management to fix.

Sorry for the noise, and thanks for the pointer!

Thanks,
Mark.
Re: [PATCH 0/5] kvm: fix latent guest entry/exit bugs [ In reply to ]
Am 18.01.22 um 13:02 schrieb Mark Rutland:
> On Mon, Jan 17, 2022 at 06:45:36PM +0100, Paolo Bonzini wrote:
>> On 1/14/22 16:19, Mark Rutland wrote:
>>> I also think there is another issue here. When an IRQ is taken from SIE, will
>>> user_mode(regs) always be false, or could it be true if the guest userspace is
>>> running? If it can be true I think tha context tracking checks can complain,
>>> and it*might* be possible to trigger a panic().
>>
>> I think that it would be false, because the guest PSW is in the SIE block
>> and switched on SIE entry and exit, but I might be incorrect.
>
> Ah; that's the crux of my confusion: I had thought the guest PSW would
> be placed in the regular lowcore *_old_psw slots. From looking at the
> entry asm it looks like the host PSW (around the invocation of SIE) is
> stored there, since that's what the OUTSIDE + SIEEXIT handling is
> checking for.

Yes, Paolos observation is correct.
>
> Assuming that's correct, I agree this problem doesn't exist, and there's
> only the common RCU/tracing/lockdep management to fix.
>
> Sorry for the noise, and thanks for the pointer!
>
> Thanks,
> Mark.
Re: [PATCH 0/5] kvm: fix latent guest entry/exit bugs [ In reply to ]
Am 18.01.22 um 13:02 schrieb Mark Rutland:
> On Mon, Jan 17, 2022 at 06:45:36PM +0100, Paolo Bonzini wrote:
>> On 1/14/22 16:19, Mark Rutland wrote:
>>> I also think there is another issue here. When an IRQ is taken from SIE, will
>>> user_mode(regs) always be false, or could it be true if the guest userspace is
>>> running? If it can be true I think tha context tracking checks can complain,
>>> and it*might* be possible to trigger a panic().
>>
>> I think that it would be false, because the guest PSW is in the SIE block
>> and switched on SIE entry and exit, but I might be incorrect.
>
> Ah; that's the crux of my confusion: I had thought the guest PSW would
> be placed in the regular lowcore *_old_psw slots. From looking at the
> entry asm it looks like the host PSW (around the invocation of SIE) is
> stored there, since that's what the OUTSIDE + SIEEXIT handling is
> checking for.
>
> Assuming that's correct, I agree this problem doesn't exist, and there's
> only the common RCU/tracing/lockdep management to fix.

Will you provide an s390 patch in your next iteration or shall we then do
one as soon as there is a v2? We also need to look into vsie.c where we
also call sie64a
Re: [PATCH 0/5] kvm: fix latent guest entry/exit bugs [ In reply to ]
On Tue, Jan 18, 2022 at 01:42:26PM +0100, Christian Borntraeger wrote:
>
>
> Am 18.01.22 um 13:02 schrieb Mark Rutland:
> > On Mon, Jan 17, 2022 at 06:45:36PM +0100, Paolo Bonzini wrote:
> > > On 1/14/22 16:19, Mark Rutland wrote:
> > > > I also think there is another issue here. When an IRQ is taken from SIE, will
> > > > user_mode(regs) always be false, or could it be true if the guest userspace is
> > > > running? If it can be true I think tha context tracking checks can complain,
> > > > and it*might* be possible to trigger a panic().
> > >
> > > I think that it would be false, because the guest PSW is in the SIE block
> > > and switched on SIE entry and exit, but I might be incorrect.
> >
> > Ah; that's the crux of my confusion: I had thought the guest PSW would
> > be placed in the regular lowcore *_old_psw slots. From looking at the
> > entry asm it looks like the host PSW (around the invocation of SIE) is
> > stored there, since that's what the OUTSIDE + SIEEXIT handling is
> > checking for.
> >
> > Assuming that's correct, I agree this problem doesn't exist, and there's
> > only the common RCU/tracing/lockdep management to fix.
>
> Will you provide an s390 patch in your next iteration or shall we then do
> one as soon as there is a v2? We also need to look into vsie.c where we
> also call sie64a

I'm having a go at that now; my plan is to try to have an s390 patch as
part of v2 in the next day or so.

Now that I have a rough idea of how SIE and exception handling works on
s390, I think the structural changes to kvm-s390.c:__vcpu_run() and
vsie.c:do_vsie_run() are fairly simple.

The only open bit is exactly how/where to identify when the interrupt
entry code needs to wake RCU. I can add a per-cpu variable or thread
flag to indicate that we're inside that EQS, or or I could move the irq
enable/disable into the sie64a asm and identify that as with the OUTSIDE
macro in the entry asm.

Thanks,
Mark.
Re: [PATCH 0/5] kvm: fix latent guest entry/exit bugs [ In reply to ]
Am 18.01.22 um 14:12 schrieb Mark Rutland:
> On Tue, Jan 18, 2022 at 01:42:26PM +0100, Christian Borntraeger wrote:
>>
>>
>> Am 18.01.22 um 13:02 schrieb Mark Rutland:
>>> On Mon, Jan 17, 2022 at 06:45:36PM +0100, Paolo Bonzini wrote:
>>>> On 1/14/22 16:19, Mark Rutland wrote:
>>>>> I also think there is another issue here. When an IRQ is taken from SIE, will
>>>>> user_mode(regs) always be false, or could it be true if the guest userspace is
>>>>> running? If it can be true I think tha context tracking checks can complain,
>>>>> and it*might* be possible to trigger a panic().
>>>>
>>>> I think that it would be false, because the guest PSW is in the SIE block
>>>> and switched on SIE entry and exit, but I might be incorrect.
>>>
>>> Ah; that's the crux of my confusion: I had thought the guest PSW would
>>> be placed in the regular lowcore *_old_psw slots. From looking at the
>>> entry asm it looks like the host PSW (around the invocation of SIE) is
>>> stored there, since that's what the OUTSIDE + SIEEXIT handling is
>>> checking for.
>>>
>>> Assuming that's correct, I agree this problem doesn't exist, and there's
>>> only the common RCU/tracing/lockdep management to fix.
>>
>> Will you provide an s390 patch in your next iteration or shall we then do
>> one as soon as there is a v2? We also need to look into vsie.c where we
>> also call sie64a
>
> I'm having a go at that now; my plan is to try to have an s390 patch as
> part of v2 in the next day or so.
>
> Now that I have a rough idea of how SIE and exception handling works on
> s390, I think the structural changes to kvm-s390.c:__vcpu_run() and
> vsie.c:do_vsie_run() are fairly simple.
>
> The only open bit is exactly how/where to identify when the interrupt
> entry code needs to wake RCU. I can add a per-cpu variable or thread
> flag to indicate that we're inside that EQS, or or I could move the irq
> enable/disable into the sie64a asm and identify that as with the OUTSIDE
> macro in the entry asm.
What exactly would the low-level interrupt handler need to do?

CC Sven, Heiko for the entry.S changes.
Re: [PATCH 0/5] kvm: fix latent guest entry/exit bugs [ In reply to ]
On Tue, Jan 18, 2022 at 03:15:51PM +0100, Christian Borntraeger wrote:
> Am 18.01.22 um 14:12 schrieb Mark Rutland:
> > On Tue, Jan 18, 2022 at 01:42:26PM +0100, Christian Borntraeger wrote:
> > >
> > >
> > > Am 18.01.22 um 13:02 schrieb Mark Rutland:
> > > > On Mon, Jan 17, 2022 at 06:45:36PM +0100, Paolo Bonzini wrote:
> > > > > On 1/14/22 16:19, Mark Rutland wrote:
> > > > > > I also think there is another issue here. When an IRQ is taken from SIE, will
> > > > > > user_mode(regs) always be false, or could it be true if the guest userspace is
> > > > > > running? If it can be true I think tha context tracking checks can complain,
> > > > > > and it*might* be possible to trigger a panic().
> > > > >
> > > > > I think that it would be false, because the guest PSW is in the SIE block
> > > > > and switched on SIE entry and exit, but I might be incorrect.
> > > >
> > > > Ah; that's the crux of my confusion: I had thought the guest PSW would
> > > > be placed in the regular lowcore *_old_psw slots. From looking at the
> > > > entry asm it looks like the host PSW (around the invocation of SIE) is
> > > > stored there, since that's what the OUTSIDE + SIEEXIT handling is
> > > > checking for.
> > > >
> > > > Assuming that's correct, I agree this problem doesn't exist, and there's
> > > > only the common RCU/tracing/lockdep management to fix.
> > >
> > > Will you provide an s390 patch in your next iteration or shall we then do
> > > one as soon as there is a v2? We also need to look into vsie.c where we
> > > also call sie64a
> >
> > I'm having a go at that now; my plan is to try to have an s390 patch as
> > part of v2 in the next day or so.
> >
> > Now that I have a rough idea of how SIE and exception handling works on
> > s390, I think the structural changes to kvm-s390.c:__vcpu_run() and
> > vsie.c:do_vsie_run() are fairly simple.
> >
> > The only open bit is exactly how/where to identify when the interrupt
> > entry code needs to wake RCU. I can add a per-cpu variable or thread
> > flag to indicate that we're inside that EQS, or or I could move the irq
> > enable/disable into the sie64a asm and identify that as with the OUTSIDE
> > macro in the entry asm.
> What exactly would the low-level interrupt handler need to do?

Having looked around a bit, I think the best bet is to have
irqentry_enter() check PF_VCPU in addition to PF_IDLE (which it checks
via is_idle_task()), at which point nothing needs to change in the s390
entry code.

I'm currently implementing that, let me have a go, and then we can see
if that looks ok or whether we should do something else.

> CC Sven, Heiko for the entry.S changes.

I'll make sure you're all Cc'd when I send out vs with s390 patches.

Thanks,
Mark.
Re: [PATCH 0/5] kvm: fix latent guest entry/exit bugs [ In reply to ]
Hi Mark,

Mark Rutland <mark.rutland@arm.com> writes:

> On Tue, Jan 18, 2022 at 01:42:26PM +0100, Christian Borntraeger wrote:
>>
>>
>> Am 18.01.22 um 13:02 schrieb Mark Rutland:
>> > On Mon, Jan 17, 2022 at 06:45:36PM +0100, Paolo Bonzini wrote:
>> > > On 1/14/22 16:19, Mark Rutland wrote:
>> > > > I also think there is another issue here. When an IRQ is taken from SIE, will
>> > > > user_mode(regs) always be false, or could it be true if the guest userspace is
>> > > > running? If it can be true I think tha context tracking checks can complain,
>> > > > and it*might* be possible to trigger a panic().
>> > >
>> > > I think that it would be false, because the guest PSW is in the SIE block
>> > > and switched on SIE entry and exit, but I might be incorrect.
>> >
>> > Ah; that's the crux of my confusion: I had thought the guest PSW would
>> > be placed in the regular lowcore *_old_psw slots. From looking at the
>> > entry asm it looks like the host PSW (around the invocation of SIE) is
>> > stored there, since that's what the OUTSIDE + SIEEXIT handling is
>> > checking for.
>> >
>> > Assuming that's correct, I agree this problem doesn't exist, and there's
>> > only the common RCU/tracing/lockdep management to fix.
>>
>> Will you provide an s390 patch in your next iteration or shall we then do
>> one as soon as there is a v2? We also need to look into vsie.c where we
>> also call sie64a
>
> I'm having a go at that now; my plan is to try to have an s390 patch as
> part of v2 in the next day or so.
>
> Now that I have a rough idea of how SIE and exception handling works on
> s390, I think the structural changes to kvm-s390.c:__vcpu_run() and
> vsie.c:do_vsie_run() are fairly simple.
>
> The only open bit is exactly how/where to identify when the interrupt
> entry code needs to wake RCU. I can add a per-cpu variable or thread
> flag to indicate that we're inside that EQS, or or I could move the irq
> enable/disable into the sie64a asm and identify that as with the OUTSIDE
> macro in the entry asm.

I wonder whether the code in irqentry_enter() should call a function
is_eqs() instead of is_idle_task(). The default implementation would
be just a

#ifndef is_eqs
#define is_eqs is_idle_task
#endif

and if an architecture has special requirements, it could just define
is_eqs() and do the required checks there. This way the architecture
could define whether it's a percpu bit, a cpu flag or something else.

/Sven
Re: [PATCH 0/5] kvm: fix latent guest entry/exit bugs [ In reply to ]
On Tue, Jan 18, 2022 at 05:09:25PM +0100, Sven Schnelle wrote:
> Hi Mark,

Hi Sven,

> Mark Rutland <mark.rutland@arm.com> writes:
> > On Tue, Jan 18, 2022 at 01:42:26PM +0100, Christian Borntraeger wrote:
> >> Will you provide an s390 patch in your next iteration or shall we then do
> >> one as soon as there is a v2? We also need to look into vsie.c where we
> >> also call sie64a
> >
> > I'm having a go at that now; my plan is to try to have an s390 patch as
> > part of v2 in the next day or so.
> >
> > Now that I have a rough idea of how SIE and exception handling works on
> > s390, I think the structural changes to kvm-s390.c:__vcpu_run() and
> > vsie.c:do_vsie_run() are fairly simple.
> >
> > The only open bit is exactly how/where to identify when the interrupt
> > entry code needs to wake RCU. I can add a per-cpu variable or thread
> > flag to indicate that we're inside that EQS, or or I could move the irq
> > enable/disable into the sie64a asm and identify that as with the OUTSIDE
> > macro in the entry asm.
>
> I wonder whether the code in irqentry_enter() should call a function
> is_eqs() instead of is_idle_task(). The default implementation would
> be just a
>
> #ifndef is_eqs
> #define is_eqs is_idle_task
> #endif
>
> and if an architecture has special requirements, it could just define
> is_eqs() and do the required checks there. This way the architecture
> could define whether it's a percpu bit, a cpu flag or something else.

I had come to almost the same approach: I've added an arch_in_rcu_eqs()
which is checked in addition to the existing is_idle_thread() check.

In the case of checking is_idle_thread() and checking for PF_VCPU, I'm
assuming the compiler can merge the loads of current->flags, and there's
little gain by making this entirely architecture specific, but we can
always check that and/or reconsider in future.

Thanks,
Mark.
Re: [PATCH 0/5] kvm: fix latent guest entry/exit bugs [ In reply to ]
On Tue, Jan 18, 2022 at 05:50:51PM +0000, Mark Rutland wrote:
> On Tue, Jan 18, 2022 at 05:09:25PM +0100, Sven Schnelle wrote:
> > I wonder whether the code in irqentry_enter() should call a function
> > is_eqs() instead of is_idle_task(). The default implementation would
> > be just a
> >
> > #ifndef is_eqs
> > #define is_eqs is_idle_task
> > #endif
> >
> > and if an architecture has special requirements, it could just define
> > is_eqs() and do the required checks there. This way the architecture
> > could define whether it's a percpu bit, a cpu flag or something else.
>
> I had come to almost the same approach: I've added an arch_in_rcu_eqs()
> which is checked in addition to the existing is_idle_thread() check.
>
> In the case of checking is_idle_thread() and checking for PF_VCPU, I'm
> assuming the compiler can merge the loads of current->flags, and there's
> little gain by making this entirely architecture specific, but we can
> always check that and/or reconsider in future.

FWIW, I've pushed out my WIP to:

https://git.kernel.org/pub/scm/linux/kernel/git/mark/linux.git/log/?h=kvm/entry-rework

... and I intend to clean that up and get it out on the list tomorrow.

The new entry/exit helpers are:

https://git.kernel.org/pub/scm/linux/kernel/git/mark/linux.git/commit/?h=kvm/entry-rework&id=df292ecabba50145849d8c8888cec9153267b31d

The arch_in_rcu_eqs() bit is:

https://git.kernel.org/pub/scm/linux/kernel/git/mark/linux.git/commit/?h=kvm/entry-rework&id=6e24c5ed7558ee7a4c95dfe62891dfdc51e6c6c4

The s390 changes are:

https://git.kernel.org/pub/scm/linux/kernel/git/mark/linux.git/commit/?h=kvm/entry-rework&id=ca8daba1809b6e4f1be425ca93f6373a2ea0af6b

I need to clean up the commit messages (including typos, TODOs, and
deleting some stale gunk), and there are some comments to write, but by
and large I think the structure is about right.

Thanks,
Mark.
Re: [PATCH 0/5] kvm: fix latent guest entry/exit bugs [ In reply to ]
Hi Mark,

Mark Rutland <mark.rutland@arm.com> writes:

> On Tue, Jan 18, 2022 at 05:09:25PM +0100, Sven Schnelle wrote:
>> Mark Rutland <mark.rutland@arm.com> writes:
>> > On Tue, Jan 18, 2022 at 01:42:26PM +0100, Christian Borntraeger wrote:
>> >> Will you provide an s390 patch in your next iteration or shall we then do
>> >> one as soon as there is a v2? We also need to look into vsie.c where we
>> >> also call sie64a
>> >
>> > I'm having a go at that now; my plan is to try to have an s390 patch as
>> > part of v2 in the next day or so.
>> >
>> > Now that I have a rough idea of how SIE and exception handling works on
>> > s390, I think the structural changes to kvm-s390.c:__vcpu_run() and
>> > vsie.c:do_vsie_run() are fairly simple.
>> >
>> > The only open bit is exactly how/where to identify when the interrupt
>> > entry code needs to wake RCU. I can add a per-cpu variable or thread
>> > flag to indicate that we're inside that EQS, or or I could move the irq
>> > enable/disable into the sie64a asm and identify that as with the OUTSIDE
>> > macro in the entry asm.
>>
>> I wonder whether the code in irqentry_enter() should call a function
>> is_eqs() instead of is_idle_task(). The default implementation would
>> be just a
>>
>> #ifndef is_eqs
>> #define is_eqs is_idle_task
>> #endif
>>
>> and if an architecture has special requirements, it could just define
>> is_eqs() and do the required checks there. This way the architecture
>> could define whether it's a percpu bit, a cpu flag or something else.
>
> I had come to almost the same approach: I've added an arch_in_rcu_eqs()
> which is checked in addition to the existing is_idle_thread() check.
>

Sounds good, thanks!

/Sven