Mailing List Archive

[PATCH] KVM: x86: Don't unnecessarily force masterclock update on vCPU hotplug
Don't force a masterclock update when a vCPU synchronizes to the current
TSC generation, e.g. when userspace hotplugs a pre-created vCPU into the
VM. Unnecessarily updating the masterclock is undesirable as it can cause
kvmclock's time to jump, which is particularly painful on systems with a
stable TSC as kvmclock _should_ be fully reliable on such systems.

The unexpected time jumps are due to differences in the TSC=>nanoseconds
conversion algorithms between kvmclock and the host's CLOCK_MONOTONIC_RAW
(the pvclock algorithm is inherently lossy). When updating the
masterclock, KVM refreshes the "base", i.e. moves the elapsed time since
the last update from the kvmclock/pvclock algorithm to the
CLOCK_MONOTONIC_RAW algorithm. Synchronizing kvmclock with
CLOCK_MONOTONIC_RAW is the lesser of evils when the TSC is unstable, but
adds no real value when the TSC is stable.

Prior to commit 7f187922ddf6 ("KVM: x86: update masterclock values on TSC
writes"), KVM did NOT force an update when synchronizing a vCPU to the
current generation.

commit 7f187922ddf6b67f2999a76dcb71663097b75497
Author: Marcelo Tosatti <mtosatti@redhat.com>
Date: Tue Nov 4 21:30:44 2014 -0200

KVM: x86: update masterclock values on TSC writes

When the guest writes to the TSC, the masterclock TSC copy must be
updated as well along with the TSC_OFFSET update, otherwise a negative
tsc_timestamp is calculated at kvm_guest_time_update.

Once "if (!vcpus_matched && ka->use_master_clock)" is simplified to
"if (ka->use_master_clock)", the corresponding "if (!ka->use_master_clock)"
becomes redundant, so remove the do_request boolean and collapse
everything into a single condition.

Before that, KVM only re-synced the masterclock if the masterclock was
enabled or disabled Note, at the time of the above commit, VMX
synchronized TSC on *guest* writes to MSR_IA32_TSC:

case MSR_IA32_TSC:
kvm_write_tsc(vcpu, msr_info);
break;

which is why the changelog specifically says "guest writes", but the bug
that was being fixed wasn't unique to guest write, i.e. a TSC write from
the host would suffer the same problem.

So even though KVM stopped synchronizing on guest writes as of commit
0c899c25d754 ("KVM: x86: do not attempt TSC synchronization on guest
writes"), simply reverting commit 7f187922ddf6 is not an option. Figuring
out how a negative tsc_timestamp could be computed requires a bit more
sleuthing.

In kvm_write_tsc() (at the time), except for KVM's "less than 1 second"
hack, KVM snapshotted the vCPU's current TSC *and* the current time in
nanoseconds, where kvm->arch.cur_tsc_nsec is the current host kernel time
in nanoseconds:

ns = get_kernel_ns();

...

if (usdiff < USEC_PER_SEC &&
vcpu->arch.virtual_tsc_khz == kvm->arch.last_tsc_khz) {
...
} else {
/*
* We split periods of matched TSC writes into generations.
* For each generation, we track the original measured
* nanosecond time, offset, and write, so if TSCs are in
* sync, we can match exact offset, and if not, we can match
* exact software computation in compute_guest_tsc()
*
* These values are tracked in kvm->arch.cur_xxx variables.
*/
kvm->arch.cur_tsc_generation++;
kvm->arch.cur_tsc_nsec = ns;
kvm->arch.cur_tsc_write = data;
kvm->arch.cur_tsc_offset = offset;
matched = false;
pr_debug("kvm: new tsc generation %llu, clock %llu\n",
kvm->arch.cur_tsc_generation, data);
}

...

/* Keep track of which generation this VCPU has synchronized to */
vcpu->arch.this_tsc_generation = kvm->arch.cur_tsc_generation;
vcpu->arch.this_tsc_nsec = kvm->arch.cur_tsc_nsec;
vcpu->arch.this_tsc_write = kvm->arch.cur_tsc_write;

Note that the above creates a new generation and sets "matched" to false!
But because kvm_track_tsc_matching() looks for matched+1, i.e. doesn't
require the vCPU that creates the new generation to match itself, KVM
would immediately compute vcpus_matched as true for VMs with a single vCPU.
As a result, KVM would skip the masterlock update, even though a new TSC
generation was created:

vcpus_matched = (ka->nr_vcpus_matched_tsc + 1 ==
atomic_read(&vcpu->kvm->online_vcpus));

if (vcpus_matched && gtod->clock.vclock_mode == VCLOCK_TSC)
if (!ka->use_master_clock)
do_request = 1;

if (!vcpus_matched && ka->use_master_clock)
do_request = 1;

if (do_request)
kvm_make_request(KVM_REQ_MASTERCLOCK_UPDATE, vcpu);

On hardware without TSC scaling support, vcpu->tsc_catchup is set to true
if the guest TSC frequency is faster than the host TSC frequency, even if
the TSC is otherwise stable. And for that mode, kvm_guest_time_update(),
by way of compute_guest_tsc(), uses vcpu->arch.this_tsc_nsec, a.k.a. the
kernel time at the last TSC write, to compute the guest TSC relative to
kernel time:

static u64 compute_guest_tsc(struct kvm_vcpu *vcpu, s64 kernel_ns)
{
u64 tsc = pvclock_scale_delta(kernel_ns-vcpu->arch.this_tsc_nsec,
vcpu->arch.virtual_tsc_mult,
vcpu->arch.virtual_tsc_shift);
tsc += vcpu->arch.this_tsc_write;
return tsc;
}

Except the "kernel_ns" passed to compute_guest_tsc() isn't the current
kernel time, it's the masterclock snapshot!

spin_lock(&ka->pvclock_gtod_sync_lock);
use_master_clock = ka->use_master_clock;
if (use_master_clock) {
host_tsc = ka->master_cycle_now;
kernel_ns = ka->master_kernel_ns;
}
spin_unlock(&ka->pvclock_gtod_sync_lock);

if (vcpu->tsc_catchup) {
u64 tsc = compute_guest_tsc(v, kernel_ns);
if (tsc > tsc_timestamp) {
adjust_tsc_offset_guest(v, tsc - tsc_timestamp);
tsc_timestamp = tsc;
}
}

And so when KVM skips the masterclock update after a TSC write, i.e. after
a new TSC generation is started, the "kernel_ns-vcpu->arch.this_tsc_nsec"
is *guaranteed* to generate a negative value, because this_tsc_nsec was
captured after ka->master_kernel_ns.

Forcing a masterclock update essentially fudged around that problem, but
in a heavy handed way that introduced undesirable side effects, i.e.
unnecessarily forces a masterclock update when a new vCPU joins the party
via hotplug.

Note, KVM forces masterclock updates in other weird ways that are also
likely unnecessary, e.g. when establishing a new Xen shared info page and
when userspace creates a brand new vCPU. But the Xen thing is firmly a
separate mess, and there are no known userspace VMMs that utilize kvmclock
*and* create new vCPUs after the VM is up and running. I.e. the other
issues are future problems.

Reported-by: Dongli Zhang <dongli.zhang@oracle.com>
Closes: https://lore.kernel.org/all/20230926230649.67852-1-dongli.zhang@oracle.com
Fixes: 7f187922ddf6 ("KVM: x86: update masterclock values on TSC writes")
Cc: David Woodhouse <dwmw2@infradead.org>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
arch/x86/kvm/x86.c | 29 ++++++++++++++++-------------
1 file changed, 16 insertions(+), 13 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 530d4bc2259b..61bdb6c1d000 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2510,26 +2510,29 @@ static inline int gtod_is_based_on_tsc(int mode)
}
#endif

-static void kvm_track_tsc_matching(struct kvm_vcpu *vcpu)
+static void kvm_track_tsc_matching(struct kvm_vcpu *vcpu, bool new_generation)
{
#ifdef CONFIG_X86_64
- bool vcpus_matched;
struct kvm_arch *ka = &vcpu->kvm->arch;
struct pvclock_gtod_data *gtod = &pvclock_gtod_data;

- vcpus_matched = (ka->nr_vcpus_matched_tsc + 1 ==
- atomic_read(&vcpu->kvm->online_vcpus));
+ /*
+ * To use the masterclock, the host clocksource must be based on TSC
+ * and all vCPUs must have matching TSCs. Note, the count for matching
+ * vCPUs doesn't include the reference vCPU, hence "+1".
+ */
+ bool use_master_clock = (ka->nr_vcpus_matched_tsc + 1 ==
+ atomic_read(&vcpu->kvm->online_vcpus)) &&
+ gtod_is_based_on_tsc(gtod->clock.vclock_mode);

/*
- * Once the masterclock is enabled, always perform request in
- * order to update it.
- *
- * In order to enable masterclock, the host clocksource must be TSC
- * and the vcpus need to have matched TSCs. When that happens,
- * perform request to enable masterclock.
+ * Request a masterclock update if the masterclock needs to be toggled
+ * on/off, or when starting a new generation and the masterclock is
+ * enabled (compute_guest_tsc() requires the masterclock snapshot to be
+ * taken _after_ the new generation is created).
*/
- if (ka->use_master_clock ||
- (gtod_is_based_on_tsc(gtod->clock.vclock_mode) && vcpus_matched))
+ if ((ka->use_master_clock && new_generation) ||
+ (ka->use_master_clock != use_master_clock))
kvm_make_request(KVM_REQ_MASTERCLOCK_UPDATE, vcpu);

trace_kvm_track_tsc(vcpu->vcpu_id, ka->nr_vcpus_matched_tsc,
@@ -2706,7 +2709,7 @@ static void __kvm_synchronize_tsc(struct kvm_vcpu *vcpu, u64 offset, u64 tsc,
vcpu->arch.this_tsc_nsec = kvm->arch.cur_tsc_nsec;
vcpu->arch.this_tsc_write = kvm->arch.cur_tsc_write;

- kvm_track_tsc_matching(vcpu);
+ kvm_track_tsc_matching(vcpu, !matched);
}

static void kvm_synchronize_tsc(struct kvm_vcpu *vcpu, u64 *user_value)

base-commit: 437bba5ad2bba00c2056c896753a32edf80860cc
--
2.42.0.655.g421f12c284-goog
Re: [PATCH] KVM: x86: Don't unnecessarily force masterclock update on vCPU hotplug [ In reply to ]
Tested-by: Dongli Zhang <dongli.zhang@oracle.com>


I did the test with below KVM patch, to calculate the kvmclock at the hypervisor
side.

---
arch/x86/kvm/x86.c | 24 ++++++++++++++++++++++++
1 file changed, 24 insertions(+)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index b0c47b4..9ddc437 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -3068,6 +3068,11 @@ static int kvm_guest_time_update(struct kvm_vcpu *v)
u64 tsc_timestamp, host_tsc;
u8 pvclock_flags;
bool use_master_clock;
+ struct pvclock_vcpu_time_info old_hv_clock;
+ u64 tsc, old_ns, new_ns, diff;
+ bool backward;
+
+ memcpy(&old_hv_clock, &vcpu->hv_clock, sizeof(old_hv_clock));

kernel_ns = 0;
host_tsc = 0;
@@ -3144,6 +3149,25 @@ static int kvm_guest_time_update(struct kvm_vcpu *v)

vcpu->hv_clock.flags = pvclock_flags;

+ tsc = rdtsc();
+ tsc = kvm_read_l1_tsc(v, tsc);
+ old_ns = __pvclock_read_cycles(&old_hv_clock, tsc);
+ new_ns = __pvclock_read_cycles(&vcpu->hv_clock, tsc);
+ if (old_ns > new_ns) {
+ backward = true;
+ diff = old_ns - new_ns;
+ } else {
+ backward = false;
+ diff = new_ns - old_ns;
+ }
+ pr_alert("orabug: kvm_guest_time_update() vcpu=%d, tsc=%llu, backward=%d,
diff=%llu, old_ns=%llu, new_ns=%llu\n"
+ "old (%u, %llu, %llu, %u, %d, %u), new (%u, %llu, %llu, %u, %d, %u)",
+ v->vcpu_id, tsc, backward, diff, old_ns, new_ns,
+ old_hv_clock.version, old_hv_clock.tsc_timestamp, old_hv_clock.system_time,
+ old_hv_clock.tsc_to_system_mul, old_hv_clock.tsc_shift, old_hv_clock.flags,
+ vcpu->hv_clock.version, vcpu->hv_clock.tsc_timestamp,
vcpu->hv_clock.system_time,
+ vcpu->hv_clock.tsc_to_system_mul, vcpu->hv_clock.tsc_shift,
vcpu->hv_clock.flags);
+
if (vcpu->pv_time.active)
kvm_setup_guest_pvclock(v, &vcpu->pv_time, 0);
if (vcpu->xen.vcpu_info_cache.active)
--

Dongli Zhang

On 10/18/23 12:56, Sean Christopherson wrote:
> Don't force a masterclock update when a vCPU synchronizes to the current
> TSC generation, e.g. when userspace hotplugs a pre-created vCPU into the
> VM. Unnecessarily updating the masterclock is undesirable as it can cause
> kvmclock's time to jump, which is particularly painful on systems with a
> stable TSC as kvmclock _should_ be fully reliable on such systems.
>
> The unexpected time jumps are due to differences in the TSC=>nanoseconds
> conversion algorithms between kvmclock and the host's CLOCK_MONOTONIC_RAW
> (the pvclock algorithm is inherently lossy). When updating the
> masterclock, KVM refreshes the "base", i.e. moves the elapsed time since
> the last update from the kvmclock/pvclock algorithm to the
> CLOCK_MONOTONIC_RAW algorithm. Synchronizing kvmclock with
> CLOCK_MONOTONIC_RAW is the lesser of evils when the TSC is unstable, but
> adds no real value when the TSC is stable.
>
> Prior to commit 7f187922ddf6 ("KVM: x86: update masterclock values on TSC
> writes"), KVM did NOT force an update when synchronizing a vCPU to the
> current generation.
>
> commit 7f187922ddf6b67f2999a76dcb71663097b75497
> Author: Marcelo Tosatti <mtosatti@redhat.com>
> Date: Tue Nov 4 21:30:44 2014 -0200
>
> KVM: x86: update masterclock values on TSC writes
>
> When the guest writes to the TSC, the masterclock TSC copy must be
> updated as well along with the TSC_OFFSET update, otherwise a negative
> tsc_timestamp is calculated at kvm_guest_time_update.
>
> Once "if (!vcpus_matched && ka->use_master_clock)" is simplified to
> "if (ka->use_master_clock)", the corresponding "if (!ka->use_master_clock)"
> becomes redundant, so remove the do_request boolean and collapse
> everything into a single condition.
>
> Before that, KVM only re-synced the masterclock if the masterclock was
> enabled or disabled Note, at the time of the above commit, VMX
> synchronized TSC on *guest* writes to MSR_IA32_TSC:
>
> case MSR_IA32_TSC:
> kvm_write_tsc(vcpu, msr_info);
> break;
>
> which is why the changelog specifically says "guest writes", but the bug
> that was being fixed wasn't unique to guest write, i.e. a TSC write from
> the host would suffer the same problem.
>
> So even though KVM stopped synchronizing on guest writes as of commit
> 0c899c25d754 ("KVM: x86: do not attempt TSC synchronization on guest
> writes"), simply reverting commit 7f187922ddf6 is not an option. Figuring
> out how a negative tsc_timestamp could be computed requires a bit more
> sleuthing.
>
> In kvm_write_tsc() (at the time), except for KVM's "less than 1 second"
> hack, KVM snapshotted the vCPU's current TSC *and* the current time in
> nanoseconds, where kvm->arch.cur_tsc_nsec is the current host kernel time
> in nanoseconds:
>
> ns = get_kernel_ns();
>
> ...
>
> if (usdiff < USEC_PER_SEC &&
> vcpu->arch.virtual_tsc_khz == kvm->arch.last_tsc_khz) {
> ...
> } else {
> /*
> * We split periods of matched TSC writes into generations.
> * For each generation, we track the original measured
> * nanosecond time, offset, and write, so if TSCs are in
> * sync, we can match exact offset, and if not, we can match
> * exact software computation in compute_guest_tsc()
> *
> * These values are tracked in kvm->arch.cur_xxx variables.
> */
> kvm->arch.cur_tsc_generation++;
> kvm->arch.cur_tsc_nsec = ns;
> kvm->arch.cur_tsc_write = data;
> kvm->arch.cur_tsc_offset = offset;
> matched = false;
> pr_debug("kvm: new tsc generation %llu, clock %llu\n",
> kvm->arch.cur_tsc_generation, data);
> }
>
> ...
>
> /* Keep track of which generation this VCPU has synchronized to */
> vcpu->arch.this_tsc_generation = kvm->arch.cur_tsc_generation;
> vcpu->arch.this_tsc_nsec = kvm->arch.cur_tsc_nsec;
> vcpu->arch.this_tsc_write = kvm->arch.cur_tsc_write;
>
> Note that the above creates a new generation and sets "matched" to false!
> But because kvm_track_tsc_matching() looks for matched+1, i.e. doesn't
> require the vCPU that creates the new generation to match itself, KVM
> would immediately compute vcpus_matched as true for VMs with a single vCPU.
> As a result, KVM would skip the masterlock update, even though a new TSC
> generation was created:
>
> vcpus_matched = (ka->nr_vcpus_matched_tsc + 1 ==
> atomic_read(&vcpu->kvm->online_vcpus));
>
> if (vcpus_matched && gtod->clock.vclock_mode == VCLOCK_TSC)
> if (!ka->use_master_clock)
> do_request = 1;
>
> if (!vcpus_matched && ka->use_master_clock)
> do_request = 1;
>
> if (do_request)
> kvm_make_request(KVM_REQ_MASTERCLOCK_UPDATE, vcpu);
>
> On hardware without TSC scaling support, vcpu->tsc_catchup is set to true
> if the guest TSC frequency is faster than the host TSC frequency, even if
> the TSC is otherwise stable. And for that mode, kvm_guest_time_update(),
> by way of compute_guest_tsc(), uses vcpu->arch.this_tsc_nsec, a.k.a. the
> kernel time at the last TSC write, to compute the guest TSC relative to
> kernel time:
>
> static u64 compute_guest_tsc(struct kvm_vcpu *vcpu, s64 kernel_ns)
> {
> u64 tsc = pvclock_scale_delta(kernel_ns-vcpu->arch.this_tsc_nsec,
> vcpu->arch.virtual_tsc_mult,
> vcpu->arch.virtual_tsc_shift);
> tsc += vcpu->arch.this_tsc_write;
> return tsc;
> }
>
> Except the "kernel_ns" passed to compute_guest_tsc() isn't the current
> kernel time, it's the masterclock snapshot!
>
> spin_lock(&ka->pvclock_gtod_sync_lock);
> use_master_clock = ka->use_master_clock;
> if (use_master_clock) {
> host_tsc = ka->master_cycle_now;
> kernel_ns = ka->master_kernel_ns;
> }
> spin_unlock(&ka->pvclock_gtod_sync_lock);
>
> if (vcpu->tsc_catchup) {
> u64 tsc = compute_guest_tsc(v, kernel_ns);
> if (tsc > tsc_timestamp) {
> adjust_tsc_offset_guest(v, tsc - tsc_timestamp);
> tsc_timestamp = tsc;
> }
> }
>
> And so when KVM skips the masterclock update after a TSC write, i.e. after
> a new TSC generation is started, the "kernel_ns-vcpu->arch.this_tsc_nsec"
> is *guaranteed* to generate a negative value, because this_tsc_nsec was
> captured after ka->master_kernel_ns.
>
> Forcing a masterclock update essentially fudged around that problem, but
> in a heavy handed way that introduced undesirable side effects, i.e.
> unnecessarily forces a masterclock update when a new vCPU joins the party
> via hotplug.
>
> Note, KVM forces masterclock updates in other weird ways that are also
> likely unnecessary, e.g. when establishing a new Xen shared info page and
> when userspace creates a brand new vCPU. But the Xen thing is firmly a
> separate mess, and there are no known userspace VMMs that utilize kvmclock
> *and* create new vCPUs after the VM is up and running. I.e. the other
> issues are future problems.
>
> Reported-by: Dongli Zhang <dongli.zhang@oracle.com>
> Closes: https://urldefense.com/v3/__https://lore.kernel.org/all/20230926230649.67852-1-dongli.zhang@oracle.com__;!!ACWV5N9M2RV99hQ!N3CdrL7gBde6tjlPxmd0cuqYCaVI4VGrvIqGX5I5pNx-cL_srMa6VuXUwrFXAA7nMgPXRvzndIOCkz-r1w$
> Fixes: 7f187922ddf6 ("KVM: x86: update masterclock values on TSC writes")
> Cc: David Woodhouse <dwmw2@infradead.org>
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
> arch/x86/kvm/x86.c | 29 ++++++++++++++++-------------
> 1 file changed, 16 insertions(+), 13 deletions(-)
>
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 530d4bc2259b..61bdb6c1d000 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -2510,26 +2510,29 @@ static inline int gtod_is_based_on_tsc(int mode)
> }
> #endif
>
> -static void kvm_track_tsc_matching(struct kvm_vcpu *vcpu)
> +static void kvm_track_tsc_matching(struct kvm_vcpu *vcpu, bool new_generation)
> {
> #ifdef CONFIG_X86_64
> - bool vcpus_matched;
> struct kvm_arch *ka = &vcpu->kvm->arch;
> struct pvclock_gtod_data *gtod = &pvclock_gtod_data;
>
> - vcpus_matched = (ka->nr_vcpus_matched_tsc + 1 ==
> - atomic_read(&vcpu->kvm->online_vcpus));
> + /*
> + * To use the masterclock, the host clocksource must be based on TSC
> + * and all vCPUs must have matching TSCs. Note, the count for matching
> + * vCPUs doesn't include the reference vCPU, hence "+1".
> + */
> + bool use_master_clock = (ka->nr_vcpus_matched_tsc + 1 ==
> + atomic_read(&vcpu->kvm->online_vcpus)) &&
> + gtod_is_based_on_tsc(gtod->clock.vclock_mode);
>
> /*
> - * Once the masterclock is enabled, always perform request in
> - * order to update it.
> - *
> - * In order to enable masterclock, the host clocksource must be TSC
> - * and the vcpus need to have matched TSCs. When that happens,
> - * perform request to enable masterclock.
> + * Request a masterclock update if the masterclock needs to be toggled
> + * on/off, or when starting a new generation and the masterclock is
> + * enabled (compute_guest_tsc() requires the masterclock snapshot to be
> + * taken _after_ the new generation is created).
> */
> - if (ka->use_master_clock ||
> - (gtod_is_based_on_tsc(gtod->clock.vclock_mode) && vcpus_matched))
> + if ((ka->use_master_clock && new_generation) ||
> + (ka->use_master_clock != use_master_clock))
> kvm_make_request(KVM_REQ_MASTERCLOCK_UPDATE, vcpu);
>
> trace_kvm_track_tsc(vcpu->vcpu_id, ka->nr_vcpus_matched_tsc,
> @@ -2706,7 +2709,7 @@ static void __kvm_synchronize_tsc(struct kvm_vcpu *vcpu, u64 offset, u64 tsc,
> vcpu->arch.this_tsc_nsec = kvm->arch.cur_tsc_nsec;
> vcpu->arch.this_tsc_write = kvm->arch.cur_tsc_write;
>
> - kvm_track_tsc_matching(vcpu);
> + kvm_track_tsc_matching(vcpu, !matched);
> }
>
> static void kvm_synchronize_tsc(struct kvm_vcpu *vcpu, u64 *user_value)
>
> base-commit: 437bba5ad2bba00c2056c896753a32edf80860cc
Re: [PATCH] KVM: x86: Don't unnecessarily force masterclock update on vCPU hotplug [ In reply to ]
Hi Sean,

Would mind sharing if the patch is waiting for Reviewed-by, and when it will be
merged into kvm-x86 tree?

While I not sure if the same developer can give both Tested-by and Reviewed-by ...

Reviewed-by: Dongli Zhang <dongli.zhang@oracle.com>


Thank you very much!

Dongli Zhang

On 10/20/23 00:45, Dongli Zhang wrote:
> Tested-by: Dongli Zhang <dongli.zhang@oracle.com>
>
>
> I did the test with below KVM patch, to calculate the kvmclock at the hypervisor
> side.
>
> ---
> arch/x86/kvm/x86.c | 24 ++++++++++++++++++++++++
> 1 file changed, 24 insertions(+)
>
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index b0c47b4..9ddc437 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -3068,6 +3068,11 @@ static int kvm_guest_time_update(struct kvm_vcpu *v)
> u64 tsc_timestamp, host_tsc;
> u8 pvclock_flags;
> bool use_master_clock;
> + struct pvclock_vcpu_time_info old_hv_clock;
> + u64 tsc, old_ns, new_ns, diff;
> + bool backward;
> +
> + memcpy(&old_hv_clock, &vcpu->hv_clock, sizeof(old_hv_clock));
>
> kernel_ns = 0;
> host_tsc = 0;
> @@ -3144,6 +3149,25 @@ static int kvm_guest_time_update(struct kvm_vcpu *v)
>
> vcpu->hv_clock.flags = pvclock_flags;
>
> + tsc = rdtsc();
> + tsc = kvm_read_l1_tsc(v, tsc);
> + old_ns = __pvclock_read_cycles(&old_hv_clock, tsc);
> + new_ns = __pvclock_read_cycles(&vcpu->hv_clock, tsc);
> + if (old_ns > new_ns) {
> + backward = true;
> + diff = old_ns - new_ns;
> + } else {
> + backward = false;
> + diff = new_ns - old_ns;
> + }
> + pr_alert("orabug: kvm_guest_time_update() vcpu=%d, tsc=%llu, backward=%d,
> diff=%llu, old_ns=%llu, new_ns=%llu\n"
> + "old (%u, %llu, %llu, %u, %d, %u), new (%u, %llu, %llu, %u, %d, %u)",
> + v->vcpu_id, tsc, backward, diff, old_ns, new_ns,
> + old_hv_clock.version, old_hv_clock.tsc_timestamp, old_hv_clock.system_time,
> + old_hv_clock.tsc_to_system_mul, old_hv_clock.tsc_shift, old_hv_clock.flags,
> + vcpu->hv_clock.version, vcpu->hv_clock.tsc_timestamp,
> vcpu->hv_clock.system_time,
> + vcpu->hv_clock.tsc_to_system_mul, vcpu->hv_clock.tsc_shift,
> vcpu->hv_clock.flags);
> +
> if (vcpu->pv_time.active)
> kvm_setup_guest_pvclock(v, &vcpu->pv_time, 0);
> if (vcpu->xen.vcpu_info_cache.active)
> --
>
> Dongli Zhang
>
> On 10/18/23 12:56, Sean Christopherson wrote:
>> Don't force a masterclock update when a vCPU synchronizes to the current
>> TSC generation, e.g. when userspace hotplugs a pre-created vCPU into the
>> VM. Unnecessarily updating the masterclock is undesirable as it can cause
>> kvmclock's time to jump, which is particularly painful on systems with a
>> stable TSC as kvmclock _should_ be fully reliable on such systems.
>>
>> The unexpected time jumps are due to differences in the TSC=>nanoseconds
>> conversion algorithms between kvmclock and the host's CLOCK_MONOTONIC_RAW
>> (the pvclock algorithm is inherently lossy). When updating the
>> masterclock, KVM refreshes the "base", i.e. moves the elapsed time since
>> the last update from the kvmclock/pvclock algorithm to the
>> CLOCK_MONOTONIC_RAW algorithm. Synchronizing kvmclock with
>> CLOCK_MONOTONIC_RAW is the lesser of evils when the TSC is unstable, but
>> adds no real value when the TSC is stable.
>>
>> Prior to commit 7f187922ddf6 ("KVM: x86: update masterclock values on TSC
>> writes"), KVM did NOT force an update when synchronizing a vCPU to the
>> current generation.
>>
>> commit 7f187922ddf6b67f2999a76dcb71663097b75497
>> Author: Marcelo Tosatti <mtosatti@redhat.com>
>> Date: Tue Nov 4 21:30:44 2014 -0200
>>
>> KVM: x86: update masterclock values on TSC writes
>>
>> When the guest writes to the TSC, the masterclock TSC copy must be
>> updated as well along with the TSC_OFFSET update, otherwise a negative
>> tsc_timestamp is calculated at kvm_guest_time_update.
>>
>> Once "if (!vcpus_matched && ka->use_master_clock)" is simplified to
>> "if (ka->use_master_clock)", the corresponding "if (!ka->use_master_clock)"
>> becomes redundant, so remove the do_request boolean and collapse
>> everything into a single condition.
>>
>> Before that, KVM only re-synced the masterclock if the masterclock was
>> enabled or disabled Note, at the time of the above commit, VMX
>> synchronized TSC on *guest* writes to MSR_IA32_TSC:
>>
>> case MSR_IA32_TSC:
>> kvm_write_tsc(vcpu, msr_info);
>> break;
>>
>> which is why the changelog specifically says "guest writes", but the bug
>> that was being fixed wasn't unique to guest write, i.e. a TSC write from
>> the host would suffer the same problem.
>>
>> So even though KVM stopped synchronizing on guest writes as of commit
>> 0c899c25d754 ("KVM: x86: do not attempt TSC synchronization on guest
>> writes"), simply reverting commit 7f187922ddf6 is not an option. Figuring
>> out how a negative tsc_timestamp could be computed requires a bit more
>> sleuthing.
>>
>> In kvm_write_tsc() (at the time), except for KVM's "less than 1 second"
>> hack, KVM snapshotted the vCPU's current TSC *and* the current time in
>> nanoseconds, where kvm->arch.cur_tsc_nsec is the current host kernel time
>> in nanoseconds:
>>
>> ns = get_kernel_ns();
>>
>> ...
>>
>> if (usdiff < USEC_PER_SEC &&
>> vcpu->arch.virtual_tsc_khz == kvm->arch.last_tsc_khz) {
>> ...
>> } else {
>> /*
>> * We split periods of matched TSC writes into generations.
>> * For each generation, we track the original measured
>> * nanosecond time, offset, and write, so if TSCs are in
>> * sync, we can match exact offset, and if not, we can match
>> * exact software computation in compute_guest_tsc()
>> *
>> * These values are tracked in kvm->arch.cur_xxx variables.
>> */
>> kvm->arch.cur_tsc_generation++;
>> kvm->arch.cur_tsc_nsec = ns;
>> kvm->arch.cur_tsc_write = data;
>> kvm->arch.cur_tsc_offset = offset;
>> matched = false;
>> pr_debug("kvm: new tsc generation %llu, clock %llu\n",
>> kvm->arch.cur_tsc_generation, data);
>> }
>>
>> ...
>>
>> /* Keep track of which generation this VCPU has synchronized to */
>> vcpu->arch.this_tsc_generation = kvm->arch.cur_tsc_generation;
>> vcpu->arch.this_tsc_nsec = kvm->arch.cur_tsc_nsec;
>> vcpu->arch.this_tsc_write = kvm->arch.cur_tsc_write;
>>
>> Note that the above creates a new generation and sets "matched" to false!
>> But because kvm_track_tsc_matching() looks for matched+1, i.e. doesn't
>> require the vCPU that creates the new generation to match itself, KVM
>> would immediately compute vcpus_matched as true for VMs with a single vCPU.
>> As a result, KVM would skip the masterlock update, even though a new TSC
>> generation was created:
>>
>> vcpus_matched = (ka->nr_vcpus_matched_tsc + 1 ==
>> atomic_read(&vcpu->kvm->online_vcpus));
>>
>> if (vcpus_matched && gtod->clock.vclock_mode == VCLOCK_TSC)
>> if (!ka->use_master_clock)
>> do_request = 1;
>>
>> if (!vcpus_matched && ka->use_master_clock)
>> do_request = 1;
>>
>> if (do_request)
>> kvm_make_request(KVM_REQ_MASTERCLOCK_UPDATE, vcpu);
>>
>> On hardware without TSC scaling support, vcpu->tsc_catchup is set to true
>> if the guest TSC frequency is faster than the host TSC frequency, even if
>> the TSC is otherwise stable. And for that mode, kvm_guest_time_update(),
>> by way of compute_guest_tsc(), uses vcpu->arch.this_tsc_nsec, a.k.a. the
>> kernel time at the last TSC write, to compute the guest TSC relative to
>> kernel time:
>>
>> static u64 compute_guest_tsc(struct kvm_vcpu *vcpu, s64 kernel_ns)
>> {
>> u64 tsc = pvclock_scale_delta(kernel_ns-vcpu->arch.this_tsc_nsec,
>> vcpu->arch.virtual_tsc_mult,
>> vcpu->arch.virtual_tsc_shift);
>> tsc += vcpu->arch.this_tsc_write;
>> return tsc;
>> }
>>
>> Except the "kernel_ns" passed to compute_guest_tsc() isn't the current
>> kernel time, it's the masterclock snapshot!
>>
>> spin_lock(&ka->pvclock_gtod_sync_lock);
>> use_master_clock = ka->use_master_clock;
>> if (use_master_clock) {
>> host_tsc = ka->master_cycle_now;
>> kernel_ns = ka->master_kernel_ns;
>> }
>> spin_unlock(&ka->pvclock_gtod_sync_lock);
>>
>> if (vcpu->tsc_catchup) {
>> u64 tsc = compute_guest_tsc(v, kernel_ns);
>> if (tsc > tsc_timestamp) {
>> adjust_tsc_offset_guest(v, tsc - tsc_timestamp);
>> tsc_timestamp = tsc;
>> }
>> }
>>
>> And so when KVM skips the masterclock update after a TSC write, i.e. after
>> a new TSC generation is started, the "kernel_ns-vcpu->arch.this_tsc_nsec"
>> is *guaranteed* to generate a negative value, because this_tsc_nsec was
>> captured after ka->master_kernel_ns.
>>
>> Forcing a masterclock update essentially fudged around that problem, but
>> in a heavy handed way that introduced undesirable side effects, i.e.
>> unnecessarily forces a masterclock update when a new vCPU joins the party
>> via hotplug.
>>
>> Note, KVM forces masterclock updates in other weird ways that are also
>> likely unnecessary, e.g. when establishing a new Xen shared info page and
>> when userspace creates a brand new vCPU. But the Xen thing is firmly a
>> separate mess, and there are no known userspace VMMs that utilize kvmclock
>> *and* create new vCPUs after the VM is up and running. I.e. the other
>> issues are future problems.
>>
>> Reported-by: Dongli Zhang <dongli.zhang@oracle.com>
>> Closes: https://urldefense.com/v3/__https://lore.kernel.org/all/20230926230649.67852-1-dongli.zhang@oracle.com__;!!ACWV5N9M2RV99hQ!N3CdrL7gBde6tjlPxmd0cuqYCaVI4VGrvIqGX5I5pNx-cL_srMa6VuXUwrFXAA7nMgPXRvzndIOCkz-r1w$
>> Fixes: 7f187922ddf6 ("KVM: x86: update masterclock values on TSC writes")
>> Cc: David Woodhouse <dwmw2@infradead.org>
>> Signed-off-by: Sean Christopherson <seanjc@google.com>
>> ---
>> arch/x86/kvm/x86.c | 29 ++++++++++++++++-------------
>> 1 file changed, 16 insertions(+), 13 deletions(-)
>>
>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>> index 530d4bc2259b..61bdb6c1d000 100644
>> --- a/arch/x86/kvm/x86.c
>> +++ b/arch/x86/kvm/x86.c
>> @@ -2510,26 +2510,29 @@ static inline int gtod_is_based_on_tsc(int mode)
>> }
>> #endif
>>
>> -static void kvm_track_tsc_matching(struct kvm_vcpu *vcpu)
>> +static void kvm_track_tsc_matching(struct kvm_vcpu *vcpu, bool new_generation)
>> {
>> #ifdef CONFIG_X86_64
>> - bool vcpus_matched;
>> struct kvm_arch *ka = &vcpu->kvm->arch;
>> struct pvclock_gtod_data *gtod = &pvclock_gtod_data;
>>
>> - vcpus_matched = (ka->nr_vcpus_matched_tsc + 1 ==
>> - atomic_read(&vcpu->kvm->online_vcpus));
>> + /*
>> + * To use the masterclock, the host clocksource must be based on TSC
>> + * and all vCPUs must have matching TSCs. Note, the count for matching
>> + * vCPUs doesn't include the reference vCPU, hence "+1".
>> + */
>> + bool use_master_clock = (ka->nr_vcpus_matched_tsc + 1 ==
>> + atomic_read(&vcpu->kvm->online_vcpus)) &&
>> + gtod_is_based_on_tsc(gtod->clock.vclock_mode);
>>
>> /*
>> - * Once the masterclock is enabled, always perform request in
>> - * order to update it.
>> - *
>> - * In order to enable masterclock, the host clocksource must be TSC
>> - * and the vcpus need to have matched TSCs. When that happens,
>> - * perform request to enable masterclock.
>> + * Request a masterclock update if the masterclock needs to be toggled
>> + * on/off, or when starting a new generation and the masterclock is
>> + * enabled (compute_guest_tsc() requires the masterclock snapshot to be
>> + * taken _after_ the new generation is created).
>> */
>> - if (ka->use_master_clock ||
>> - (gtod_is_based_on_tsc(gtod->clock.vclock_mode) && vcpus_matched))
>> + if ((ka->use_master_clock && new_generation) ||
>> + (ka->use_master_clock != use_master_clock))
>> kvm_make_request(KVM_REQ_MASTERCLOCK_UPDATE, vcpu);
>>
>> trace_kvm_track_tsc(vcpu->vcpu_id, ka->nr_vcpus_matched_tsc,
>> @@ -2706,7 +2709,7 @@ static void __kvm_synchronize_tsc(struct kvm_vcpu *vcpu, u64 offset, u64 tsc,
>> vcpu->arch.this_tsc_nsec = kvm->arch.cur_tsc_nsec;
>> vcpu->arch.this_tsc_write = kvm->arch.cur_tsc_write;
>>
>> - kvm_track_tsc_matching(vcpu);
>> + kvm_track_tsc_matching(vcpu, !matched);
>> }
>>
>> static void kvm_synchronize_tsc(struct kvm_vcpu *vcpu, u64 *user_value)
>>
>> base-commit: 437bba5ad2bba00c2056c896753a32edf80860cc
Re: [PATCH] KVM: x86: Don't unnecessarily force masterclock update on vCPU hotplug [ In reply to ]
On 14 November 2023 14:39:39 GMT-05:00, Sean Christopherson <seanjc@google.com> wrote:
> timing doesn't really matter in the end.

No pun intended?
Re: [PATCH] KVM: x86: Don't unnecessarily force masterclock update on vCPU hotplug [ In reply to ]
On Tue, Nov 14, 2023, Dongli Zhang wrote:
> Hi Sean,
>
> Would mind sharing if the patch is waiting for Reviewed-by, and when it will be
> merged into kvm-x86 tree?

I'm at LPC this week, and out next week, so nothing is going to get applied to
kvm-x86 until after -rc3. I considered trying to squeeze in a few things this
week, but decided to just wait until -rc3 and not rush anything, as the timing
doesn't really matter in the end.

> While I not sure if the same developer can give both Tested-by and Reviewed-by ...
>
> Reviewed-by: Dongli Zhang <dongli.zhang@oracle.com>

Thanks! Providing both a Reviewed-by and Tested-by is totally valid.
Re: [PATCH] KVM: x86: Don't unnecessarily force masterclock update on vCPU hotplug [ In reply to ]
On Wed, 18 Oct 2023 12:56:38 -0700, Sean Christopherson wrote:
> Don't force a masterclock update when a vCPU synchronizes to the current
> TSC generation, e.g. when userspace hotplugs a pre-created vCPU into the
> VM. Unnecessarily updating the masterclock is undesirable as it can cause
> kvmclock's time to jump, which is particularly painful on systems with a
> stable TSC as kvmclock _should_ be fully reliable on such systems.
>
> The unexpected time jumps are due to differences in the TSC=>nanoseconds
> conversion algorithms between kvmclock and the host's CLOCK_MONOTONIC_RAW
> (the pvclock algorithm is inherently lossy). When updating the
> masterclock, KVM refreshes the "base", i.e. moves the elapsed time since
> the last update from the kvmclock/pvclock algorithm to the
> CLOCK_MONOTONIC_RAW algorithm. Synchronizing kvmclock with
> CLOCK_MONOTONIC_RAW is the lesser of evils when the TSC is unstable, but
> adds no real value when the TSC is stable.
>
> [...]

Applied to kvm-x86 misc, thanks!

[1/1] KVM: x86: Don't unnecessarily force masterclock update on vCPU hotplug
https://github.com/kvm-x86/linux/commit/c52ffadc65e2

--
https://github.com/kvm-x86/linux/tree/next
Re: [PATCH] KVM: x86: Don't unnecessarily force masterclock update on vCPU hotplug [ In reply to ]
On Wed, 2023-10-18 at 12:56 -0700, Sean Christopherson wrote:
> Don't force a masterclock update when a vCPU synchronizes to the current
> TSC generation, e.g. when userspace hotplugs a pre-created vCPU into the
> VM.  Unnecessarily updating the masterclock is undesirable as it can cause
> kvmclock's time to jump, which is particularly painful on systems with a
> stable TSC as kvmclock _should_ be fully reliable on such systems.
>
> The unexpected time jumps are due to differences in the TSC=>nanoseconds
> conversion algorithms between kvmclock and the host's CLOCK_MONOTONIC_RAW
> (the pvclock algorithm is inherently lossy).  When updating the
> masterclock, KVM refreshes the "base", i.e. moves the elapsed time since
> the last update from the kvmclock/pvclock algorithm to the
> CLOCK_MONOTONIC_RAW algorithm.  Synchronizing kvmclock with
> CLOCK_MONOTONIC_RAW is the lesser of evils when the TSC is unstable, but
> adds no real value when the TSC is stable.
>
> Prior to commit 7f187922ddf6 ("KVM: x86: update masterclock values on TSC
> writes"), KVM did NOT force an update when synchronizing a vCPU to the
> current generation.
>
>   commit 7f187922ddf6b67f2999a76dcb71663097b75497
>   Author: Marcelo Tosatti <mtosatti@redhat.com>
>   Date:   Tue Nov 4 21:30:44 2014 -0200
>
>     KVM: x86: update masterclock values on TSC writes
>
>     When the guest writes to the TSC, the masterclock TSC copy must be
>     updated as well along with the TSC_OFFSET update, otherwise a negative
>     tsc_timestamp is calculated at kvm_guest_time_update.
>
>     Once "if (!vcpus_matched && ka->use_master_clock)" is simplified to
>     "if (ka->use_master_clock)", the corresponding "if (!ka->use_master_clock)"
>     becomes redundant, so remove the do_request boolean and collapse
>     everything into a single condition.
>
> Before that, KVM only re-synced the masterclock if the masterclock was
> enabled or disabled  Note, at the time of the above commit, VMX
> synchronized TSC on *guest* writes to MSR_IA32_TSC:
>
>         case MSR_IA32_TSC:
>                 kvm_write_tsc(vcpu, msr_info);
>                 break;
>
> which is why the changelog specifically says "guest writes", but the bug
> that was being fixed wasn't unique to guest write, i.e. a TSC write from
> the host would suffer the same problem.
>
> So even though KVM stopped synchronizing on guest writes as of commit
> 0c899c25d754 ("KVM: x86: do not attempt TSC synchronization on guest
> writes"), simply reverting commit 7f187922ddf6 is not an option.  Figuring
> out how a negative tsc_timestamp could be computed requires a bit more
> sleuthing.
>
> In kvm_write_tsc() (at the time), except for KVM's "less than 1 second"
> hack, KVM snapshotted the vCPU's current TSC *and* the current time in
> nanoseconds, where kvm->arch.cur_tsc_nsec is the current host kernel time
> in nanoseconds:
>
>         ns = get_kernel_ns();
>
>         ...
>
>         if (usdiff < USEC_PER_SEC &&
>             vcpu->arch.virtual_tsc_khz == kvm->arch.last_tsc_khz) {
>                 ...
>         } else {
>                 /*
>                  * We split periods of matched TSC writes into generations.
>                  * For each generation, we track the original measured
>                  * nanosecond time, offset, and write, so if TSCs are in
>                  * sync, we can match exact offset, and if not, we can match
>                  * exact software computation in compute_guest_tsc()
>                  *
>                  * These values are tracked in kvm->arch.cur_xxx variables.
>                  */
>                 kvm->arch.cur_tsc_generation++;
>                 kvm->arch.cur_tsc_nsec = ns;
>                 kvm->arch.cur_tsc_write = data;
>                 kvm->arch.cur_tsc_offset = offset;
>                 matched = false;
>                 pr_debug("kvm: new tsc generation %llu, clock %llu\n",
>                          kvm->arch.cur_tsc_generation, data);
>         }
>
>         ...
>
>         /* Keep track of which generation this VCPU has synchronized to */
>         vcpu->arch.this_tsc_generation = kvm->arch.cur_tsc_generation;
>         vcpu->arch.this_tsc_nsec = kvm->arch.cur_tsc_nsec;
>         vcpu->arch.this_tsc_write = kvm->arch.cur_tsc_write;
>
> Note that the above creates a new generation and sets "matched" to false!
> But because kvm_track_tsc_matching() looks for matched+1, i.e. doesn't
> require the vCPU that creates the new generation to match itself, KVM
> would immediately compute vcpus_matched as true for VMs with a single vCPU.
> As a result, KVM would skip the masterlock update, even though a new TSC
> generation was created:
>
>         vcpus_matched = (ka->nr_vcpus_matched_tsc + 1 ==
>                          atomic_read(&vcpu->kvm->online_vcpus));
>
>         if (vcpus_matched && gtod->clock.vclock_mode == VCLOCK_TSC)
>                 if (!ka->use_master_clock)
>                         do_request = 1;
>
>         if (!vcpus_matched && ka->use_master_clock)
>                         do_request = 1;
>
>         if (do_request)
>                 kvm_make_request(KVM_REQ_MASTERCLOCK_UPDATE, vcpu);
>
> On hardware without TSC scaling support, vcpu->tsc_catchup is set to true
> if the guest TSC frequency is faster than the host TSC frequency, even if
> the TSC is otherwise stable.  And for that mode, kvm_guest_time_update(),
> by way of compute_guest_tsc(), uses vcpu->arch.this_tsc_nsec, a.k.a. the
> kernel time at the last TSC write, to compute the guest TSC relative to
> kernel time:
>
>   static u64 compute_guest_tsc(struct kvm_vcpu *vcpu, s64 kernel_ns)
>   {
>         u64 tsc = pvclock_scale_delta(kernel_ns-vcpu->arch.this_tsc_nsec,
>                                       vcpu->arch.virtual_tsc_mult,
>                                       vcpu->arch.virtual_tsc_shift);
>         tsc += vcpu->arch.this_tsc_write;
>         return tsc;
>   }
>
> Except the "kernel_ns" passed to compute_guest_tsc() isn't the current
> kernel time, it's the masterclock snapshot!
>
>         spin_lock(&ka->pvclock_gtod_sync_lock);
>         use_master_clock = ka->use_master_clock;
>         if (use_master_clock) {
>                 host_tsc = ka->master_cycle_now;
>                 kernel_ns = ka->master_kernel_ns;
>         }
>         spin_unlock(&ka->pvclock_gtod_sync_lock);
>
>         if (vcpu->tsc_catchup) {
>                 u64 tsc = compute_guest_tsc(v, kernel_ns);
>                 if (tsc > tsc_timestamp) {
>                         adjust_tsc_offset_guest(v, tsc - tsc_timestamp);
>                         tsc_timestamp = tsc;
>                 }
>         }
>
> And so when KVM skips the masterclock update after a TSC write, i.e. after
> a new TSC generation is started, the "kernel_ns-vcpu->arch.this_tsc_nsec"
> is *guaranteed* to generate a negative value, because this_tsc_nsec was
> captured after ka->master_kernel_ns.

So what? It *should* be negative, shouldn't it? I think the problem is
how we're using that value, and what we're conflating it with.

Let us consider the case where ka->use_master_clock is true, but we're
manually upscaling the TSC in software so vcpu->tsc_catchup is also
true.

Let us postpone, for the moment, the question of whether we should even
*let* use_master_clock become true in that case.

There are a number of points in time which need to be considered:

• vcpu->arch.this_tsc_nsec
* kvm->arch.master_kernel_ns
* The point in time "now" at which kvm_guest_time_update() is called.

For any given point in time, compute_guest_tsc() should calculate the
guest TSC at that moment, by scaling the elapsed microseconds since
vcpu->arch.this_tsc_nsec to the guest TSC frequency and adding that to
vcpu->arch.this_tsc_write.

I say "should", because compute_guest_tsc() is currently buggy when
asked to scale a *negative* number. Trivially fixable though.

Now, let's look at what kvm_guest_time_update() is doing. It attempts
to do two things. First it calculates the guest TSC at the reference
point that it's putting into the pvclock structure. That's what needs
to go into the 'tsc_timestamp' field of the pvclock structure alongside
the corresponding KVM clock 'system_time' at 'kernel_ns'. In master
clock mode, the value it uses for kernel_ns is ka->master_kernel_ns,
and otherwise it is the current time..

It's perfectly reasonable for master_kernel_ns to be earlier in time
than vcpu->this_tsc_nsec. That just means the TSC value we write to the
pvclock ends up being lower than the value in vcpu->this_tsc_write, by
an appropriate number of cycles. So as long as compute_guest_tsc()
isn't buggy with negative numbers, that should all be fine.

But there *is* a bug in kvm_guest_time_update(), I think...

In tsc_catchup mode, simulating a TSC which runs faster than the host,
the delta between host and guest TSCs gets larger and larger over
time. That's why kvm_guest_time_update() is called *every* time the
vCPU is entered, to adjust the TSC further and further every time.

But currently, kvm_guest_time_update() only nudges the guest TSC as far
forward as it should have been at master_kernel_ns. At any time later
than master_kernel_ns, the delta should be even higher.

I think compute_guest_tsc() should look something like this, to cope
with the negativity:

static u64 compute_guest_tsc(struct kvm_vcpu *vcpu, s64 kernel_ns)
{
s64 delta = kernel_ns - vcpu->arch.this_tsc_nsec;
u64 tsc = vcpu->arch.this_tsc_write;

/* pvclock_scale_delta cannot cope with negative deltas */
if (delta >= 0)
tsc += pvclock_scale_delta(delta,
vcpu->arch.virtual_tsc_mult,
vcpu->arch.virtual_tsc_shift);
else
tsc -= pvclock_scale_delta(-delta,
vcpu->arch.virtual_tsc_mult,
vcpu->arch.virtual_tsc_shift);

return tsc;
}

And the catchup code in kvm_guest_time_update() should correct *both*
the reference time *and* the current TSC by *different* amounts,
something like this:

if (vcpu->tsc_catchup) {
uint64_t now_guest_tsc_adjusted;
uint64_t now_guest_tsc_unadjusted;
int64_t now_guest_tsc_delta;

tsc_timestamp = compute_guest_tsc(v, kernel_ns);

if (use_master_clock) {
uint64_t now_host_tsc;
int64_t now_kernel_ns;

if (!kvm_get_time_and_clockread(&now_kernel_ns, &now_host_tsc)) {
now_kernel_ns = get_kvmclock_base_ns();
now_host_tsc = rdtsc();
}
now_guest_tsc_adjusted = compute_guest_tsc(v, now_kernel_ns);
now_guest_tsc_unadjusted = kvm_read_l1_tsc(v, now_host_tsc);
} else {
now_guest_tsc_adjusted = tsc_timestamp;
now_guest_tsc_unadjusted = kvm_read_l1_tsc(v, kernel_ns);
}

now_guest_tsc_delta = now_guest_tsc_adjusted -
now_guest_tsc_unadjusted;

if (now_guest_tsc_delta > 0)
adjust_tsc_offset_guest(v, now_guest_tsc_delta);
} else {
tsc_timestamp = kvm_read_l1_tsc(v, host_tsc);
}

Then we can drop that extra masterclock update in
kvm_track_tsc_matching(), along with the comment that
compute_guest_tsc() needs the masterclock snapshot to be newer.

> Forcing a masterclock update essentially fudged around that problem, but
> in a heavy handed way that introduced undesirable side effects, i.e.
> unnecessarily forces a masterclock update when a new vCPU joins the party
> via hotplug.
>
> Note, KVM forces masterclock updates in other weird ways that are also
> likely unnecessary, e.g. when establishing a new Xen shared info page and
> when userspace creates a brand new vCPU.  But the Xen thing is firmly a
> separate mess, and there are no known userspace VMMs that utilize kvmclock
> *and* create new vCPUs after the VM is up and running.  I.e. the other
> issues are future problems.
>
> Reported-by: Dongli Zhang <dongli.zhang@oracle.com>
> Closes: https://lore.kernel.org/all/20230926230649.67852-1-dongli.zhang@oracle.com
> Fixes: 7f187922ddf6 ("KVM: x86: update masterclock values on TSC writes")
> Cc: David Woodhouse <dwmw2@infradead.org>
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
>  arch/x86/kvm/x86.c | 29 ++++++++++++++++-------------
>  1 file changed, 16 insertions(+), 13 deletions(-)
>
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 530d4bc2259b..61bdb6c1d000 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -2510,26 +2510,29 @@ static inline int gtod_is_based_on_tsc(int mode)
>  }
>  #endif
>  
> -static void kvm_track_tsc_matching(struct kvm_vcpu *vcpu)
> +static void kvm_track_tsc_matching(struct kvm_vcpu *vcpu, bool new_generation)
>  {
>  #ifdef CONFIG_X86_64
> -       bool vcpus_matched;
>         struct kvm_arch *ka = &vcpu->kvm->arch;
>         struct pvclock_gtod_data *gtod = &pvclock_gtod_data;
>  
> -       vcpus_matched = (ka->nr_vcpus_matched_tsc + 1 ==
> -                        atomic_read(&vcpu->kvm->online_vcpus));
> +       /*
> +        * To use the masterclock, the host clocksource must be based on TSC
> +        * and all vCPUs must have matching TSCs.  Note, the count for matching
> +        * vCPUs doesn't include the reference vCPU, hence "+1".
> +        */
> +       bool use_master_clock = (ka->nr_vcpus_matched_tsc + 1 ==
> +                                atomic_read(&vcpu->kvm->online_vcpus)) &&
> +                               gtod_is_based_on_tsc(gtod->clock.vclock_mode);
>  
>         /*
> -        * Once the masterclock is enabled, always perform request in
> -        * order to update it.
> -        *
> -        * In order to enable masterclock, the host clocksource must be TSC
> -        * and the vcpus need to have matched TSCs.  When that happens,
> -        * perform request to enable masterclock.
> +        * Request a masterclock update if the masterclock needs to be toggled
> +        * on/off, or when starting a new generation and the masterclock is
> +        * enabled (compute_guest_tsc() requires the masterclock snapshot to be
> +        * taken _after_ the new generation is created).
>          */
> -       if (ka->use_master_clock ||
> -           (gtod_is_based_on_tsc(gtod->clock.vclock_mode) && vcpus_matched))
> +       if ((ka->use_master_clock && new_generation) ||
> +           (ka->use_master_clock != use_master_clock))
>                 kvm_make_request(KVM_REQ_MASTERCLOCK_UPDATE, vcpu);
>  
>         trace_kvm_track_tsc(vcpu->vcpu_id, ka->nr_vcpus_matched_tsc,
> @@ -2706,7 +2709,7 @@ static void __kvm_synchronize_tsc(struct kvm_vcpu *vcpu, u64 offset, u64 tsc,
>         vcpu->arch.this_tsc_nsec = kvm->arch.cur_tsc_nsec;
>         vcpu->arch.this_tsc_write = kvm->arch.cur_tsc_write;
>  
> -       kvm_track_tsc_matching(vcpu);
> +       kvm_track_tsc_matching(vcpu, !matched);
>  }
>  
>  static void kvm_synchronize_tsc(struct kvm_vcpu *vcpu, u64 *user_value)
>
> base-commit: 437bba5ad2bba00c2056c896753a32edf80860cc