Mailing List Archive

Re: [PATCH RFC V4 0/5] kvm : Paravirt-spinlock support for KVM guests
On 16.01.2012, at 07:40, Jeremy Fitzhardinge wrote:

> On Jan 16, 2012, at 2:57 PM, Alexander Graf wrote:
>
>>
>> On 14.01.2012, at 19:25, Raghavendra K T wrote:
>>
>>> The 5-patch series to follow this email extends KVM-hypervisor and Linux guest
>>> running on KVM-hypervisor to support pv-ticket spinlocks, based on Xen's implementation.
>>>
>>> One hypercall is introduced in KVM hypervisor,that allows a vcpu to kick
>>> another vcpu out of halt state.
>>> The blocking of vcpu is done using halt() in (lock_spinning) slowpath.
>>
>> Is the code for this even upstream? Prerequisite series seem to have been posted by Jeremy, but they didn't appear to have made it in yet.
>
> No, not yet. The patches are unchanged since I last posted them, and as far as I know there are no objections to them, but I'd like to get some performance numbers just to make sure they don't cause any surprising regressions, especially in the non-virtual case.

Yup, that's a very good idea :)

>
>>
>> Either way, thinking about this I stumbled over the following passage of his patch:
>>
>>> + unsigned count = SPIN_THRESHOLD;
>>> +
>>> + do {
>>> + if (inc.head == inc.tail)
>>> + goto out;
>>> + cpu_relax();
>>> + inc.head = ACCESS_ONCE(lock->tickets.head);
>>> + } while (--count);
>>> + __ticket_lock_spinning(lock, inc.tail);
>>
>>
>> That means we're spinning for n cycles, then notify the spinlock holder that we'd like to get kicked and go sleeping. While I'm pretty sure that it improves the situation, it doesn't solve all of the issues we have.
>>
>> Imagine we have an idle host. All vcpus can freely run and everyone can fetch the lock as fast as on real machines. We don't need to / want to go to sleep here. Locks that take too long are bugs that need to be solved on real hw just as well, so all we do is possibly incur overhead.
>
> I'm not quite sure what your concern is. The lock is under contention, so there's nothing to do except spin; all this patch adds is a variable decrement/test to the spin loop, but that's not going to waste any more CPU than the non-counting case. And once it falls into the blocking path, its a win because the VCPU isn't burning CPU any more.
>
>>
>> Imagine we have a contended host. Every vcpu gets at most 10% of a real CPU's runtime. So chances are 1:10 that you're currently running while you need to be. In such a setup, it's probably a good idea to be very pessimistic. Try to fetch the lock for 100 cycles and then immediately make room for all the other VMs that have real work going on!
>
> Are you saying the threshold should be dynamic depending on how loaded the system is? How can a guest know what the overall system contention is? How should a guest use that to work out a good spin time?

I'm saying what I'm saying in the next paragraph :). The guest doesn't know, but the host does. So if we had shared memory between guest and host, the host could put its threshold limit in there, which on an idle system could be -1 and on a contended system could be 1.

> One possibility is to use the ticket lock queue depth to work out how contended the lock is, and therefore how long it might be worth waiting for. I was thinking of something along the lines of "threshold = (THRESHOLD >> queue_depth)". But that's pure hand wave, and someone would actually need to experiment before coming up with something reasonable.
>
> But all of this is good to consider for future work, rather than being essential for the first version.

Well, yes, of course! It's by no means an objection to what's there today. I'm just trying to think of ways to make it even better :)

>
>> So what I'm trying to get to is that if we had a hypervisor settable spin threshold, we could adjust it according to the host's load, getting VMs to behave differently on different (guest invisible) circumstances.
>>
>> Speaking of which - don't we have spin lock counters in the CPUs now? I thought we could set intercepts that notify us when the guest issues too many repz nops or whatever the typical spinlock identifier was. Can't we reuse that and just interrupt the guest if we see this with a special KVM interrupt that kicks off the internal spin lock waiting code? That way we don't slow down all those bare metal boxes.
>
> Yes, that mechanism exists, but it doesn't solve a very interesting problem.
>
> The most important thing to solve is making sure that when *releasing* a ticketlock, the correct next VCPU gets scheduled promptly. If you don't, you're just relying on the VCPU scheduler getting around to scheduling the correct VCPU, but if it doesn't it just ends up burning a timeslice of PCPU time while the wrong VCPU spins.
>
> Limiting the spin time with a timeout or the rep/nop interrupt somewhat mitigates this, but it still means you end up spending a lot of time slices spinning the wrong VCPU until it finally schedules the correct one. And the more contended the machine is, the worse the problem gets.

This is true in case you're spinning. If on overcommit spinlocks would instead of spin just yield(), we wouldn't have any vcpu running that's just waiting for a late ticket.

We still have an issue finding the point in time when a vcpu could run again, which is what this whole series is about. My point above was that instead of doing a count loop, we could just do the normal spin dance and set the threshold to when we enable the magic to have another spin lock notify us in the CPU. That way we

* don't change the uncontended case
* can set the threshold on the host, which knows how contended the system is

And since we control what spin locks look like, we can for example always keep the pointer to it in a specific register so that we can handle pv_lock_ops.lock_spinning() inside there and fetch all the information we need from our pt_regs.

>
>> Speaking of which - have you benchmarked performance degradation of pv ticket locks on bare metal? Last time I checked, enabling all the PV ops did incur significant slowdown which is why I went though the work to split the individual pv ops features up to only enable a few for KVM guests.
>
> The whole point of the pv-ticketlock work is to keep the pvops hooks out of the locking fast path, so that the calls are only made on the slow path - that is, when spinning too long on a contended lock, and when releasing a lock that's in a "slow" state. In the fast path case of no contention, there are no pvops, and the executed code path is almost identical to native.

You're still changing a tight loop that does nothing (CPU detects it and saves power) into something that performs calculations.

> But as I mentioned above, I'd like to see some benchmarks to prove that's the case.

Yes, that would be very good to have :)


Alex


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel
Re: [PATCH RFC V4 0/5] kvm : Paravirt-spinlock support for KVM guests [ In reply to ]
On 14.01.2012, at 19:25, Raghavendra K T wrote:

> The 5-patch series to follow this email extends KVM-hypervisor and Linux guest
> running on KVM-hypervisor to support pv-ticket spinlocks, based on Xen's implementation.
>
> One hypercall is introduced in KVM hypervisor,that allows a vcpu to kick
> another vcpu out of halt state.
> The blocking of vcpu is done using halt() in (lock_spinning) slowpath.

Is the code for this even upstream? Prerequisite series seem to have been posted by Jeremy, but they didn't appear to have made it in yet.

Either way, thinking about this I stumbled over the following passage of his patch:

> + unsigned count = SPIN_THRESHOLD;
> +
> + do {
> + if (inc.head == inc.tail)
> + goto out;
> + cpu_relax();
> + inc.head = ACCESS_ONCE(lock->tickets.head);
> + } while (--count);
> + __ticket_lock_spinning(lock, inc.tail);


That means we're spinning for n cycles, then notify the spinlock holder that we'd like to get kicked and go sleeping. While I'm pretty sure that it improves the situation, it doesn't solve all of the issues we have.

Imagine we have an idle host. All vcpus can freely run and everyone can fetch the lock as fast as on real machines. We don't need to / want to go to sleep here. Locks that take too long are bugs that need to be solved on real hw just as well, so all we do is possibly incur overhead.

Imagine we have a contended host. Every vcpu gets at most 10% of a real CPU's runtime. So chances are 1:10 that you're currently running while you need to be. In such a setup, it's probably a good idea to be very pessimistic. Try to fetch the lock for 100 cycles and then immediately make room for all the other VMs that have real work going on!

So what I'm trying to get to is that if we had a hypervisor settable spin threshold, we could adjust it according to the host's load, getting VMs to behave differently on different (guest invisible) circumstances.

Speaking of which - don't we have spin lock counters in the CPUs now? I thought we could set intercepts that notify us when the guest issues too many repz nops or whatever the typical spinlock identifier was. Can't we reuse that and just interrupt the guest if we see this with a special KVM interrupt that kicks off the internal spin lock waiting code? That way we don't slow down all those bare metal boxes.

Speaking of which - have you benchmarked performance degradation of pv ticket locks on bare metal? Last time I checked, enabling all the PV ops did incur significant slowdown which is why I went though the work to split the individual pv ops features up to only enable a few for KVM guests.

>
> Changes in V4:
> - reabsed to 3.2.0 pre.
> - use APIC ID for kicking the vcpu and use kvm_apic_match_dest for matching. (Avi)
> - fold vcpu->kicked flag into vcpu->requests (KVM_REQ_PVLOCK_KICK) and related
> changes for UNHALT path to make pv ticket spinlock migration friendly. (Avi, Marcello)
> - Added Documentation for CPUID, Hypercall (KVM_HC_KICK_CPU)
> and capabilty (KVM_CAP_PVLOCK_KICK) (Avi)
> - Remove unneeded kvm_arch_vcpu_ioctl_set_mpstate call. (Marcello)
> - cumulative variable type changed (int ==> u32) in add_stat (Konrad)
> - remove unneeded kvm_guest_init for !CONFIG_KVM_GUEST case
>
> Changes in V3:
> - rebased to 3.2-rc1
> - use halt() instead of wait for kick hypercall.
> - modify kick hyper call to do wakeup halted vcpu.
> - hook kvm_spinlock_init to smp_prepare_cpus call (moved the call out of head##.c).
> - fix the potential race when zero_stat is read.
> - export debugfs_create_32 and add documentation to API.
> - use static inline and enum instead of ADDSTAT macro.
> - add barrier() in after setting kick_vcpu.
> - empty static inline function for kvm_spinlock_init.
> - combine the patches one and two readuce overhead.
> - make KVM_DEBUGFS depends on DEBUGFS.
> - include debugfs header unconditionally.
>
> Changes in V2:
> - rebased patchesto -rc9
> - synchronization related changes based on Jeremy's changes
> (Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>) pointed by
> Stephan Diestelhorst <stephan.diestelhorst@amd.com>
> - enabling 32 bit guests
> - splitted patches into two more chunks
>
> Srivatsa Vaddagiri, Suzuki Poulose, Raghavendra K T (5):
> Add debugfs support to print u32-arrays in debugfs
> Add a hypercall to KVM hypervisor to support pv-ticketlocks
> Added configuration support to enable debug information for KVM Guests
> pv-ticketlocks support for linux guests running on KVM hypervisor
> Add documentation on Hypercalls and features used for PV spinlock
>
> Test Set up :
> The BASE patch is pre 3.2.0 + Jeremy's following patches.
> xadd (https://lkml.org/lkml/2011/10/4/328)
> x86/ticketlocklock (https://lkml.org/lkml/2011/10/12/496).
> Kernel for host/guest : 3.2.0 + Jeremy's xadd, pv spinlock patches as BASE
> (Note:locked add change is not taken yet)
>
> Results:
> The performance gain is mainly because of reduced busy-wait time.
> From the results we can see that patched kernel performance is similar to
> BASE when there is no lock contention. But once we start seeing more
> contention, patched kernel outperforms BASE (non PLE).
> On PLE machine we do not see greater performance improvement because of PLE
> complimenting halt()
>
> 3 guests with 8VCPU, 4GB RAM, 1 used for kernbench
> (kernbench -f -H -M -o 20) other for cpuhog (shell script while
> true with an instruction)
>
> scenario A: unpinned
>
> 1x: no hogs
> 2x: 8hogs in one guest
> 3x: 8hogs each in two guest
>
> scenario B: unpinned, run kernbench on all the guests no hogs.
>
> Dbench on PLE machine:
> dbench run on all the guest simultaneously with
> dbench --warmup=30 -t 120 with NRCLIENTS=(8/16/32).
>
> Result for Non PLE machine :
> ============================
> Machine : IBM xSeries with Intel(R) Xeon(R) x5570 2.93GHz CPU with 8 core , 64GB RAM
> BASE BASE+patch %improvement
> mean (sd) mean (sd)
> Scenario A:
> case 1x: 164.233 (16.5506) 163.584 (15.4598 0.39517
> case 2x: 897.654 (543.993) 328.63 (103.771) 63.3901
> case 3x: 2855.73 (2201.41) 315.029 (111.854) 88.9685
>
> Dbench:
> Throughput is in MB/sec
> NRCLIENTS BASE BASE+patch %improvement
> mean (sd) mean (sd)
> 8 1.774307 (0.061361) 1.725667 (0.034644) -2.74135
> 16 1.445967 (0.044805) 1.463173 (0.094399) 1.18993
> 32 2.136667 (0.105717) 2.193792 (0.129357) 2.67356
>
> Result for PLE machine:
> ======================
> Machine : IBM xSeries with Intel(R) Xeon(R) X7560 2.27GHz CPU with 32/64 core, with 8
> online cores and 4*64GB RAM
>
> Kernbench:
> BASE BASE+patch %improvement
> mean (sd) mean (sd)
> Scenario A:
> case 1x: 161.263 (56.518) 159.635 (40.5621) 1.00953
> case 2x: 190.748 (61.2745) 190.606 (54.4766) 0.0744438
> case 3x: 227.378 (100.215) 225.442 (92.0809) 0.851446
>
> Scenario B:
> 446.104 (58.54 ) 433.12733 (54.476) 2.91
>
> Dbench:
> Throughput is in MB/sec
> NRCLIENTS BASE BASE+patch %improvement
> mean (sd) mean (sd)
> 8 1.101190 (0.875082) 1.700395 (0.846809) 54.4143
> 16 1.524312 (0.120354) 1.477553 (0.058166) -3.06755
> 32 2.143028 (0.157103) 2.090307 (0.136778) -2.46012

So on a very contended system we're actually slower? Is this expected?


Alex


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel
Re: [PATCH RFC V4 0/5] kvm : Paravirt-spinlock support for KVM guests [ In reply to ]
On Jan 16, 2012, at 2:57 PM, Alexander Graf wrote:

>
> On 14.01.2012, at 19:25, Raghavendra K T wrote:
>
>> The 5-patch series to follow this email extends KVM-hypervisor and Linux guest
>> running on KVM-hypervisor to support pv-ticket spinlocks, based on Xen's implementation.
>>
>> One hypercall is introduced in KVM hypervisor,that allows a vcpu to kick
>> another vcpu out of halt state.
>> The blocking of vcpu is done using halt() in (lock_spinning) slowpath.
>
> Is the code for this even upstream? Prerequisite series seem to have been posted by Jeremy, but they didn't appear to have made it in yet.

No, not yet. The patches are unchanged since I last posted them, and as far as I know there are no objections to them, but I'd like to get some performance numbers just to make sure they don't cause any surprising regressions, especially in the non-virtual case.

>
> Either way, thinking about this I stumbled over the following passage of his patch:
>
>> + unsigned count = SPIN_THRESHOLD;
>> +
>> + do {
>> + if (inc.head == inc.tail)
>> + goto out;
>> + cpu_relax();
>> + inc.head = ACCESS_ONCE(lock->tickets.head);
>> + } while (--count);
>> + __ticket_lock_spinning(lock, inc.tail);
>
>
> That means we're spinning for n cycles, then notify the spinlock holder that we'd like to get kicked and go sleeping. While I'm pretty sure that it improves the situation, it doesn't solve all of the issues we have.
>
> Imagine we have an idle host. All vcpus can freely run and everyone can fetch the lock as fast as on real machines. We don't need to / want to go to sleep here. Locks that take too long are bugs that need to be solved on real hw just as well, so all we do is possibly incur overhead.

I'm not quite sure what your concern is. The lock is under contention, so there's nothing to do except spin; all this patch adds is a variable decrement/test to the spin loop, but that's not going to waste any more CPU than the non-counting case. And once it falls into the blocking path, its a win because the VCPU isn't burning CPU any more.

>
> Imagine we have a contended host. Every vcpu gets at most 10% of a real CPU's runtime. So chances are 1:10 that you're currently running while you need to be. In such a setup, it's probably a good idea to be very pessimistic. Try to fetch the lock for 100 cycles and then immediately make room for all the other VMs that have real work going on!

Are you saying the threshold should be dynamic depending on how loaded the system is? How can a guest know what the overall system contention is? How should a guest use that to work out a good spin time?

One possibility is to use the ticket lock queue depth to work out how contended the lock is, and therefore how long it might be worth waiting for. I was thinking of something along the lines of "threshold = (THRESHOLD >> queue_depth)". But that's pure hand wave, and someone would actually need to experiment before coming up with something reasonable.

But all of this is good to consider for future work, rather than being essential for the first version.

> So what I'm trying to get to is that if we had a hypervisor settable spin threshold, we could adjust it according to the host's load, getting VMs to behave differently on different (guest invisible) circumstances.
>
> Speaking of which - don't we have spin lock counters in the CPUs now? I thought we could set intercepts that notify us when the guest issues too many repz nops or whatever the typical spinlock identifier was. Can't we reuse that and just interrupt the guest if we see this with a special KVM interrupt that kicks off the internal spin lock waiting code? That way we don't slow down all those bare metal boxes.

Yes, that mechanism exists, but it doesn't solve a very interesting problem.

The most important thing to solve is making sure that when *releasing* a ticketlock, the correct next VCPU gets scheduled promptly. If you don't, you're just relying on the VCPU scheduler getting around to scheduling the correct VCPU, but if it doesn't it just ends up burning a timeslice of PCPU time while the wrong VCPU spins.

Limiting the spin time with a timeout or the rep/nop interrupt somewhat mitigates this, but it still means you end up spending a lot of time slices spinning the wrong VCPU until it finally schedules the correct one. And the more contended the machine is, the worse the problem gets.

> Speaking of which - have you benchmarked performance degradation of pv ticket locks on bare metal? Last time I checked, enabling all the PV ops did incur significant slowdown which is why I went though the work to split the individual pv ops features up to only enable a few for KVM guests.

The whole point of the pv-ticketlock work is to keep the pvops hooks out of the locking fast path, so that the calls are only made on the slow path - that is, when spinning too long on a contended lock, and when releasing a lock that's in a "slow" state. In the fast path case of no contention, there are no pvops, and the executed code path is almost identical to native.

But as I mentioned above, I'd like to see some benchmarks to prove that's the case.

J

>
>>
>> Changes in V4:
>> - reabsed to 3.2.0 pre.
>> - use APIC ID for kicking the vcpu and use kvm_apic_match_dest for matching. (Avi)
>> - fold vcpu->kicked flag into vcpu->requests (KVM_REQ_PVLOCK_KICK) and related
>> changes for UNHALT path to make pv ticket spinlock migration friendly. (Avi, Marcello)
>> - Added Documentation for CPUID, Hypercall (KVM_HC_KICK_CPU)
>> and capabilty (KVM_CAP_PVLOCK_KICK) (Avi)
>> - Remove unneeded kvm_arch_vcpu_ioctl_set_mpstate call. (Marcello)
>> - cumulative variable type changed (int ==> u32) in add_stat (Konrad)
>> - remove unneeded kvm_guest_init for !CONFIG_KVM_GUEST case
>>
>> Changes in V3:
>> - rebased to 3.2-rc1
>> - use halt() instead of wait for kick hypercall.
>> - modify kick hyper call to do wakeup halted vcpu.
>> - hook kvm_spinlock_init to smp_prepare_cpus call (moved the call out of head##.c).
>> - fix the potential race when zero_stat is read.
>> - export debugfs_create_32 and add documentation to API.
>> - use static inline and enum instead of ADDSTAT macro.
>> - add barrier() in after setting kick_vcpu.
>> - empty static inline function for kvm_spinlock_init.
>> - combine the patches one and two readuce overhead.
>> - make KVM_DEBUGFS depends on DEBUGFS.
>> - include debugfs header unconditionally.
>>
>> Changes in V2:
>> - rebased patchesto -rc9
>> - synchronization related changes based on Jeremy's changes
>> (Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>) pointed by
>> Stephan Diestelhorst <stephan.diestelhorst@amd.com>
>> - enabling 32 bit guests
>> - splitted patches into two more chunks
>>
>> Srivatsa Vaddagiri, Suzuki Poulose, Raghavendra K T (5):
>> Add debugfs support to print u32-arrays in debugfs
>> Add a hypercall to KVM hypervisor to support pv-ticketlocks
>> Added configuration support to enable debug information for KVM Guests
>> pv-ticketlocks support for linux guests running on KVM hypervisor
>> Add documentation on Hypercalls and features used for PV spinlock
>>
>> Test Set up :
>> The BASE patch is pre 3.2.0 + Jeremy's following patches.
>> xadd (https://lkml.org/lkml/2011/10/4/328)
>> x86/ticketlocklock (https://lkml.org/lkml/2011/10/12/496).
>> Kernel for host/guest : 3.2.0 + Jeremy's xadd, pv spinlock patches as BASE
>> (Note:locked add change is not taken yet)
>>
>> Results:
>> The performance gain is mainly because of reduced busy-wait time.
>> From the results we can see that patched kernel performance is similar to
>> BASE when there is no lock contention. But once we start seeing more
>> contention, patched kernel outperforms BASE (non PLE).
>> On PLE machine we do not see greater performance improvement because of PLE
>> complimenting halt()
>>
>> 3 guests with 8VCPU, 4GB RAM, 1 used for kernbench
>> (kernbench -f -H -M -o 20) other for cpuhog (shell script while
>> true with an instruction)
>>
>> scenario A: unpinned
>>
>> 1x: no hogs
>> 2x: 8hogs in one guest
>> 3x: 8hogs each in two guest
>>
>> scenario B: unpinned, run kernbench on all the guests no hogs.
>>
>> Dbench on PLE machine:
>> dbench run on all the guest simultaneously with
>> dbench --warmup=30 -t 120 with NRCLIENTS=(8/16/32).
>>
>> Result for Non PLE machine :
>> ============================
>> Machine : IBM xSeries with Intel(R) Xeon(R) x5570 2.93GHz CPU with 8 core , 64GB RAM
>> BASE BASE+patch %improvement
>> mean (sd) mean (sd)
>> Scenario A:
>> case 1x: 164.233 (16.5506) 163.584 (15.4598 0.39517
>> case 2x: 897.654 (543.993) 328.63 (103.771) 63.3901
>> case 3x: 2855.73 (2201.41) 315.029 (111.854) 88.9685
>>
>> Dbench:
>> Throughput is in MB/sec
>> NRCLIENTS BASE BASE+patch %improvement
>> mean (sd) mean (sd)
>> 8 1.774307 (0.061361) 1.725667 (0.034644) -2.74135
>> 16 1.445967 (0.044805) 1.463173 (0.094399) 1.18993
>> 32 2.136667 (0.105717) 2.193792 (0.129357) 2.67356
>>
>> Result for PLE machine:
>> ======================
>> Machine : IBM xSeries with Intel(R) Xeon(R) X7560 2.27GHz CPU with 32/64 core, with 8
>> online cores and 4*64GB RAM
>>
>> Kernbench:
>> BASE BASE+patch %improvement
>> mean (sd) mean (sd)
>> Scenario A:
>> case 1x: 161.263 (56.518) 159.635 (40.5621) 1.00953
>> case 2x: 190.748 (61.2745) 190.606 (54.4766) 0.0744438
>> case 3x: 227.378 (100.215) 225.442 (92.0809) 0.851446
>>
>> Scenario B:
>> 446.104 (58.54 ) 433.12733 (54.476) 2.91
>>
>> Dbench:
>> Throughput is in MB/sec
>> NRCLIENTS BASE BASE+patch %improvement
>> mean (sd) mean (sd)
>> 8 1.101190 (0.875082) 1.700395 (0.846809) 54.4143
>> 16 1.524312 (0.120354) 1.477553 (0.058166) -3.06755
>> 32 2.143028 (0.157103) 2.090307 (0.136778) -2.46012
>
> So on a very contended system we're actually slower? Is this expected?
>
>
> Alex
>


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel
Re: [PATCH RFC V4 0/5] kvm : Paravirt-spinlock support for KVM guests [ In reply to ]
On 01/16/2012 08:40 AM, Jeremy Fitzhardinge wrote:
> >
> > That means we're spinning for n cycles, then notify the spinlock holder that we'd like to get kicked and go sleeping. While I'm pretty sure that it improves the situation, it doesn't solve all of the issues we have.
> >
> > Imagine we have an idle host. All vcpus can freely run and everyone can fetch the lock as fast as on real machines. We don't need to / want to go to sleep here. Locks that take too long are bugs that need to be solved on real hw just as well, so all we do is possibly incur overhead.
>
> I'm not quite sure what your concern is. The lock is under contention, so there's nothing to do except spin; all this patch adds is a variable decrement/test to the spin loop, but that's not going to waste any more CPU than the non-counting case. And once it falls into the blocking path, its a win because the VCPU isn't burning CPU any more.

The wakeup path is slower though. The previous lock holder has to
hypercall, and the new lock holder has to be scheduled, and transition
from halted state to running (a vmentry). So it's only a clear win if
we can do something with the cpu other than go into the idle loop.

> >
> > Imagine we have a contended host. Every vcpu gets at most 10% of a real CPU's runtime. So chances are 1:10 that you're currently running while you need to be. In such a setup, it's probably a good idea to be very pessimistic. Try to fetch the lock for 100 cycles and then immediately make room for all the other VMs that have real work going on!
>
> Are you saying the threshold should be dynamic depending on how loaded the system is? How can a guest know what the overall system contention is? How should a guest use that to work out a good spin time?
>
> One possibility is to use the ticket lock queue depth to work out how contended the lock is, and therefore how long it might be worth waiting for. I was thinking of something along the lines of "threshold = (THRESHOLD >> queue_depth)". But that's pure hand wave, and someone would actually need to experiment before coming up with something reasonable.
>
> But all of this is good to consider for future work, rather than being essential for the first version.

Agree.

> > So what I'm trying to get to is that if we had a hypervisor settable spin threshold, we could adjust it according to the host's load, getting VMs to behave differently on different (guest invisible) circumstances.
> >
> > Speaking of which - don't we have spin lock counters in the CPUs now? I thought we could set intercepts that notify us when the guest issues too many repz nops or whatever the typical spinlock identifier was. Can't we reuse that and just interrupt the guest if we see this with a special KVM interrupt that kicks off the internal spin lock waiting code? That way we don't slow down all those bare metal boxes.
>
> Yes, that mechanism exists, but it doesn't solve a very interesting problem.
>
> The most important thing to solve is making sure that when *releasing* a ticketlock, the correct next VCPU gets scheduled promptly. If you don't, you're just relying on the VCPU scheduler getting around to scheduling the correct VCPU, but if it doesn't it just ends up burning a timeslice of PCPU time while the wrong VCPU spins.

kvm does a directed yield to an unscheduled vcpu, selected in a round
robin fashion. So if your overload factor is N (N runnable vcpus for
every physical cpu), and your spin counter waits for S cycles before
exiting, you will burn N*S cycles (actually more since there is overhead
involved, but lets fold it into S).

> Limiting the spin time with a timeout or the rep/nop interrupt somewhat mitigates this, but it still means you end up spending a lot of time slices spinning the wrong VCPU until it finally schedules the correct one. And the more contended the machine is, the worse the problem gets.

Right.

>
> > Speaking of which - have you benchmarked performance degradation of pv ticket locks on bare metal? Last time I checked, enabling all the PV ops did incur significant slowdown which is why I went though the work to split the individual pv ops features up to only enable a few for KVM guests.
>
> The whole point of the pv-ticketlock work is to keep the pvops hooks out of the locking fast path, so that the calls are only made on the slow path - that is, when spinning too long on a contended lock, and when releasing a lock that's in a "slow" state. In the fast path case of no contention, there are no pvops, and the executed code path is almost identical to native.
>
> But as I mentioned above, I'd like to see some benchmarks to prove that's the case.
>

--
error compiling committee.c: too many arguments to function


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel
Re: [PATCH RFC V4 0/5] kvm : Paravirt-spinlock support for KVM guests [ In reply to ]
On 01/16/2012 09:27 AM, Alexander Graf wrote:
>
[...]
>> Result for PLE machine:
>> ======================
>> Machine : IBM xSeries with Intel(R) Xeon(R) X7560 2.27GHz CPU with 32/64 core, with 8
>> online cores and 4*64GB RAM
>>
>> Kernbench:
>> BASE BASE+patch %improvement
>> mean (sd) mean (sd)
>> Scenario A:
>> case 1x: 161.263 (56.518) 159.635 (40.5621) 1.00953
>> case 2x: 190.748 (61.2745) 190.606 (54.4766) 0.0744438
>> case 3x: 227.378 (100.215) 225.442 (92.0809) 0.851446
>>
>> Scenario B:
>> 446.104 (58.54 ) 433.12733 (54.476) 2.91
>>
>> Dbench:
>> Throughput is in MB/sec
>> NRCLIENTS BASE BASE+patch %improvement
>> mean (sd) mean (sd)
>> 8 1.101190 (0.875082) 1.700395 (0.846809) 54.4143
>> 16 1.524312 (0.120354) 1.477553 (0.058166) -3.06755
>> 32 2.143028 (0.157103) 2.090307 (0.136778) -2.46012
>
> So on a very contended system we're actually slower? Is this expected?
>
>

I think, the result is interesting because its PLE machine. I have to
experiment more with parameters, SPIN_THRESHOLD, and also may be ple_gap
and ple_window.

> Alex
>
>


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel
Re: [PATCH RFC V4 0/5] kvm : Paravirt-spinlock support for KVM guests [ In reply to ]
* Alexander Graf <agraf@suse.de> [2012-01-16 04:57:45]:

> Speaking of which - have you benchmarked performance degradation of pv ticket locks on bare metal?

You mean, run kernel on bare metal with CONFIG_PARAVIRT_SPINLOCKS
enabled and compare how it performs with CONFIG_PARAVIRT_SPINLOCKS disabled for
some workload(s)?

In some sense, the 1x overcommitcase results posted does measure the overhead
of (pv-)spinlocks no? We don't see any overhead in that case for atleast
kernbench ..

> Result for Non PLE machine :
> ============================

[snip]

> Kernbench:
> BASE BASE+patch
> %improvement
> mean (sd) mean (sd)
> Scenario A:
> case 1x: 164.233 (16.5506) 163.584 (15.4598 0.39517

[snip]

> Result for PLE machine:
> ======================

[snip]
> Kernbench:
> BASE BASE+patch
> %improvement
> mean (sd) mean (sd)
> Scenario A:
> case 1x: 161.263 (56.518) 159.635 (40.5621) 1.00953

- vatsa


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel
Re: [PATCH RFC V4 0/5] kvm : Paravirt-spinlock support for KVM guests [ In reply to ]
On 16.01.2012, at 15:20, Srivatsa Vaddagiri wrote:

> * Alexander Graf <agraf@suse.de> [2012-01-16 04:57:45]:
>
>> Speaking of which - have you benchmarked performance degradation of pv ticket locks on bare metal?
>
> You mean, run kernel on bare metal with CONFIG_PARAVIRT_SPINLOCKS
> enabled and compare how it performs with CONFIG_PARAVIRT_SPINLOCKS disabled for
> some workload(s)?

Yup

>
> In some sense, the 1x overcommitcase results posted does measure the overhead
> of (pv-)spinlocks no? We don't see any overhead in that case for atleast
> kernbench ..
>
>> Result for Non PLE machine :
>> ============================
>
> [snip]
>
>> Kernbench:
>> BASE BASE+patch

What is BASE really? Is BASE already with the PV spinlocks enabled? I'm having a hard time understanding which tree you're working against, since the prerequisites aren't upstream yet.


Alex

>> %improvement
>> mean (sd) mean (sd)
>> Scenario A:
>> case 1x: 164.233 (16.5506) 163.584 (15.4598 0.39517
>
> [snip]
>
>> Result for PLE machine:
>> ======================
>
> [snip]
>> Kernbench:
>> BASE BASE+patch
>> %improvement
>> mean (sd) mean (sd)
>> Scenario A:
>> case 1x: 161.263 (56.518) 159.635 (40.5621) 1.00953
>
> - vatsa
>


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel
Re: [PATCH RFC V4 0/5] kvm : Paravirt-spinlock support for KVM guests [ In reply to ]
On 01/16/2012 07:53 PM, Alexander Graf wrote:
>
> On 16.01.2012, at 15:20, Srivatsa Vaddagiri wrote:
>
>> * Alexander Graf<agraf@suse.de> [2012-01-16 04:57:45]:
>>
>>> Speaking of which - have you benchmarked performance degradation of pv ticket locks on bare metal?
>>
>> You mean, run kernel on bare metal with CONFIG_PARAVIRT_SPINLOCKS
>> enabled and compare how it performs with CONFIG_PARAVIRT_SPINLOCKS disabled for
>> some workload(s)?
>
> Yup
>
>>
>> In some sense, the 1x overcommitcase results posted does measure the overhead
>> of (pv-)spinlocks no? We don't see any overhead in that case for atleast
>> kernbench ..
>>
>>> Result for Non PLE machine :
>>> ============================
>>
>> [snip]
>>
>>> Kernbench:
>>> BASE BASE+patch
>
> What is BASE really? Is BASE already with the PV spinlocks enabled? I'm having a hard time understanding which tree you're working against, since the prerequisites aren't upstream yet.
>
>
> Alex

Sorry for confusion, I think I was little imprecise on the BASE.

The BASE is pre 3.2.0 + Jeremy's following patches:
xadd (https://lkml.org/lkml/2011/10/4/328)
x86/ticketlocklock (https://lkml.org/lkml/2011/10/12/496).
So this would have ticketlock cleanups from Jeremy and
CONFIG_PARAVIRT_SPINLOCKS=y

BASE+patch = pre 3.2.0 + Jeremy's above patches + above V5 PV spinlock
series and CONFIG_PARAVIRT_SPINLOCKS=y

In both the cases CONFIG_PARAVIRT_SPINLOCKS=y.

So let,
A. pre-3.2.0 with CONFIG_PARAVIRT_SPINLOCKS = n
B. pre-3.2.0 + Jeremy's above patches with CONFIG_PARAVIRT_SPINLOCKS = n
C. pre-3.2.0 + Jeremy's above patches with CONFIG_PARAVIRT_SPINLOCKS = y
D. pre-3.2.0 + Jeremy's above patches + V5 patches with
CONFIG_PARAVIRT_SPINLOCKS = n
E. pre-3.2.0 + Jeremy's above patches + V5 patches with
CONFIG_PARAVIRT_SPINLOCKS = y

is it performance of A vs E ? (currently C vs E)

Please let me know the configuration expected for testing.

Jeremy,
I would be happy to test A vs B vs C vs E, for some workload of interest
if you wish, for your upcoming patches.

Thanks and Regards
Raghu

>
>>> %improvement
>>> mean (sd) mean (sd)
>>> Scenario A:
>>> case 1x: 164.233 (16.5506) 163.584 (15.4598 0.39517
>>
>> [snip]
>>
>>> Result for PLE machine:
>>> ======================
>>
>> [snip]
>>> Kernbench:
>>> BASE BASE+patch
>>> %improvement
>>> mean (sd) mean (sd)
>>> Scenario A:
>>> case 1x: 161.263 (56.518) 159.635 (40.5621) 1.00953
>>
>> - vatsa
>>
>
>


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel
Re: [PATCH RFC V4 0/5] kvm : Paravirt-spinlock support for KVM guests [ In reply to ]
On 16.01.2012, at 19:38, Raghavendra K T wrote:

> On 01/16/2012 07:53 PM, Alexander Graf wrote:
>>
>> On 16.01.2012, at 15:20, Srivatsa Vaddagiri wrote:
>>
>>> * Alexander Graf<agraf@suse.de> [2012-01-16 04:57:45]:
>>>
>>>> Speaking of which - have you benchmarked performance degradation of pv ticket locks on bare metal?
>>>
>>> You mean, run kernel on bare metal with CONFIG_PARAVIRT_SPINLOCKS
>>> enabled and compare how it performs with CONFIG_PARAVIRT_SPINLOCKS disabled for
>>> some workload(s)?
>>
>> Yup
>>
>>>
>>> In some sense, the 1x overcommitcase results posted does measure the overhead
>>> of (pv-)spinlocks no? We don't see any overhead in that case for atleast
>>> kernbench ..
>>>
>>>> Result for Non PLE machine :
>>>> ============================
>>>
>>> [snip]
>>>
>>>> Kernbench:
>>>> BASE BASE+patch
>>
>> What is BASE really? Is BASE already with the PV spinlocks enabled? I'm having a hard time understanding which tree you're working against, since the prerequisites aren't upstream yet.
>>
>>
>> Alex
>
> Sorry for confusion, I think I was little imprecise on the BASE.
>
> The BASE is pre 3.2.0 + Jeremy's following patches:
> xadd (https://lkml.org/lkml/2011/10/4/328)
> x86/ticketlocklock (https://lkml.org/lkml/2011/10/12/496).
> So this would have ticketlock cleanups from Jeremy and
> CONFIG_PARAVIRT_SPINLOCKS=y
>
> BASE+patch = pre 3.2.0 + Jeremy's above patches + above V5 PV spinlock
> series and CONFIG_PARAVIRT_SPINLOCKS=y
>
> In both the cases CONFIG_PARAVIRT_SPINLOCKS=y.
>
> So let,
> A. pre-3.2.0 with CONFIG_PARAVIRT_SPINLOCKS = n
> B. pre-3.2.0 + Jeremy's above patches with CONFIG_PARAVIRT_SPINLOCKS = n
> C. pre-3.2.0 + Jeremy's above patches with CONFIG_PARAVIRT_SPINLOCKS = y
> D. pre-3.2.0 + Jeremy's above patches + V5 patches with CONFIG_PARAVIRT_SPINLOCKS = n
> E. pre-3.2.0 + Jeremy's above patches + V5 patches with CONFIG_PARAVIRT_SPINLOCKS = y
>
> is it performance of A vs E ? (currently C vs E)

Since D and E only matter with KVM in use, yes, I'm mostly interested in A, B and C :).


Alex


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel
Re: [PATCH RFC V4 0/5] kvm : Paravirt-spinlock support for KVM guests [ In reply to ]
On 01/16/2012 07:55 PM, Avi Kivity wrote:
> On 01/16/2012 08:40 AM, Jeremy Fitzhardinge wrote:
>>> That means we're spinning for n cycles, then notify the spinlock holder that we'd like to get kicked and go sleeping. While I'm pretty sure that it improves the situation, it doesn't solve all of the issues we have.
>>>
>>> Imagine we have an idle host. All vcpus can freely run and everyone can fetch the lock as fast as on real machines. We don't need to / want to go to sleep here. Locks that take too long are bugs that need to be solved on real hw just as well, so all we do is possibly incur overhead.
>> I'm not quite sure what your concern is. The lock is under contention, so there's nothing to do except spin; all this patch adds is a variable decrement/test to the spin loop, but that's not going to waste any more CPU than the non-counting case. And once it falls into the blocking path, its a win because the VCPU isn't burning CPU any more.
> The wakeup path is slower though. The previous lock holder has to
> hypercall, and the new lock holder has to be scheduled, and transition
> from halted state to running (a vmentry). So it's only a clear win if
> we can do something with the cpu other than go into the idle loop.

Not burning power is a win too.

Actually what you want is something like "if you preempt a VCPU while
its spinning in a lock, then push it into the slowpath and don't
reschedule it without a kick". But I think that interface would have a
lot of fiddly corners.

J

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel
Re: [PATCH RFC V4 0/5] kvm : Paravirt-spinlock support for KVM guests [ In reply to ]
On 01/16/2012 09:24 PM, Alexander Graf wrote:
> This is true in case you're spinning. If on overcommit spinlocks would
> instead of spin just yield(), we wouldn't have any vcpu running that's
> just waiting for a late ticket.

Yes, but the reality is that most spinlocks are held for a short period
of time and there's a low likelihood of being preempted while within a
spinlock critical section. Therefore if someone else tries to get the
spinlock and there's contention, it's always worth spinning for a little
while because the lock will likely become free soon.

At least that's the case if the lock has low contention (shallow queue
depth and not in slow state). Again, maybe it makes sense to never spin
for deep queues or already slowstate locks.

> We still have an issue finding the point in time when a vcpu could run again, which is what this whole series is about. My point above was that instead of doing a count loop, we could just do the normal spin dance and set the threshold to when we enable the magic to have another spin lock notify us in the CPU. That way we
>
> * don't change the uncontended case

I don't follow you. What do you mean by "the normal spin dance"? What
do you mean by "have another spinlock notify us in the CPU"? Don't
change which uncontended case? Do you mean in the locking path? Or the
unlock path? Or both?

> * can set the threshold on the host, which knows how contended the system is

Hm, I'm not convinced that knowing how contended the system is is all
that useful overall. What's important is how contended a particular
lock is, and what state the current holder is in. If it's not currently
running, then knowing the overall system contention would give you some
idea about how long you need to wait for it to be rescheduled, but
that's getting pretty indirect.

I think the "slowpath if preempted while spinning" idea I mentioned in
the other mail is probably worth following up, since that give specific
actionable information to the guest from the hypervisor. But lots of
caveats.

[[
A possible mechanism:

* register ranges of [er]ips with the hypervisor
* each range is paired with a "resched handler block"
* if vcpu is preempted within such a range, make sure it is
rescheduled in the resched handler block

This is obviously akin to the exception mechanism, but it is partially
implemented by the hypervisor. It allows the spinlock code to be
unchanged from native, but make use of a resched rather than an explicit
counter to determine when to slowpath the lock. And it's a nice general
mechanism that could be potentially useful elsewhere.

Unfortunately, it doesn't change the unlock path at all; it still needs
to explicitly test if a VCPU needs to be kicked on unlock.
]]


> And since we control what spin locks look like, we can for example always keep the pointer to it in a specific register so that we can handle pv_lock_ops.lock_spinning() inside there and fetch all the information we need from our pt_regs.

You've left a pile of parts of an idea lying around, but I'm not sure
what shape you intend it to be.

>>> Speaking of which - have you benchmarked performance degradation of pv ticket locks on bare metal? Last time I checked, enabling all the PV ops did incur significant slowdown which is why I went though the work to split the individual pv ops features up to only enable a few for KVM guests.
>> The whole point of the pv-ticketlock work is to keep the pvops hooks out of the locking fast path, so that the calls are only made on the slow path - that is, when spinning too long on a contended lock, and when releasing a lock that's in a "slow" state. In the fast path case of no contention, there are no pvops, and the executed code path is almost identical to native.
> You're still changing a tight loop that does nothing (CPU detects it and saves power) into something that performs calculations.

It still has a "pause" instruction in that loop, so that CPU mechanism
will still come into play. "pause" doesn't directly "save power"; it's
more about making sure that memory dependence cycles are broken and that
two competing threads will make similar progress. Besides I'm not sure
adding a dec+test to a loop that's already got a memory read and compare
in it is adding much in the way of "calculations".

J

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel
Re: [PATCH RFC V4 0/5] kvm : Paravirt-spinlock support for KVM guests [ In reply to ]
On 01/17/2012 05:29 AM, Jeremy Fitzhardinge wrote:
> On 01/16/2012 07:55 PM, Avi Kivity wrote:
>> On 01/16/2012 08:40 AM, Jeremy Fitzhardinge wrote:
>>>> That means we're spinning for n cycles, then notify the spinlock holder that we'd like to get kicked and go sleeping. While I'm pretty sure that it improves the situation, it doesn't solve all of the issues we have.
>>>>
>>>> Imagine we have an idle host. All vcpus can freely run and everyone can fetch the lock as fast as on real machines. We don't need to / want to go to sleep here. Locks that take too long are bugs that need to be solved on real hw just as well, so all we do is possibly incur overhead.
>>> I'm not quite sure what your concern is. The lock is under contention, so there's nothing to do except spin; all this patch adds is a variable decrement/test to the spin loop, but that's not going to waste any more CPU than the non-counting case. And once it falls into the blocking path, its a win because the VCPU isn't burning CPU any more.
>> The wakeup path is slower though. The previous lock holder has to
>> hypercall, and the new lock holder has to be scheduled, and transition
>> from halted state to running (a vmentry). So it's only a clear win if
>> we can do something with the cpu other than go into the idle loop.
>
> Not burning power is a win too.
>
> Actually what you want is something like "if you preempt a VCPU while
> its spinning in a lock, then push it into the slowpath and don't
> reschedule it without a kick". But I think that interface would have a
> lot of fiddly corners.
>

Yes wakeup path is little slower but better than burning cpu. no?

Suppose we have 16 vcpu,
vcpu 1- lockholder (preempted).
vcpu 2-8 - in slowpath.

If scheduler schedules vcpu-1 that is most favourable for lock progress,
But if vcpu-9 - vcpu-16 OR something else scheduled, then also it's a
win right (we are doing some useful work), but yes lock progress is
again little slower though.

The optimization areas of interests are perhaps:
(1) suppose vcpu-1 is running and is about to release lock and next
vcpu in queue just goes to halt(). so this makes us to tune
SPIN_THRESHOLD rightly and have a mechanism to determine if lock-holder
is running and do continue spin. Identifying whether lock-holder is
running would be easier task and can be next step of optimization.

(2) Much talked, identifying lockholder-preemption (vcpu) and do
yield_to().

But I am not sure how complicated is yield_to() implementation once we
have identified the exact preempted vcpu (lock-holder).

> J
>
>


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel