Mailing List Archive: [PATCH] Fix softlockup issue after vcpu hotplug

[PATCH] Fix softlockup issue after vcpu hotplug

Jan 30, 2007, 1:26 AM

Post #1 of 33 (492 views)

Stamp softlockup thread earlier before do_timer, because the
latter is the one to actually trigger lock warning for
long-time offline. Or else, I obserevd softlockup warning
easily at manual vcpu hot-remove/plug, or when suspend cancel
into old context.

One point here is to cover both stolen and blocked time to
compare with offline threshold. vcpu hotplug falls into 'stolen'
case, but it's not enough. Considering xen time model is tickless
at idle, it's possible that big block time is requested which
also inflames softlockup thread.

Signed-off-by Kevin Tian <kevin.tian@intel.com>

Thanks,
Kevin

Re: [PATCH] Fix softlockup issue after vcpu hotplug [ In reply to ]

Keir.Fraser at cl

Jan 30, 2007, 2:37 AM

Post #2 of 33 (494 views)

Permalink

On 30/1/07 08:26, "Tian, Kevin" <kevin.tian@intel.com> wrote:

> Stamp softlockup thread earlier before do_timer, because the
> latter is the one to actually trigger lock warning for
> long-time offline. Or else, I obserevd softlockup warning
> easily at manual vcpu hot-remove/plug, or when suspend cancel
> into old context.

Actually the softlockup check is triggered from run_local_timers() which is
called very near the end of timer_interrupt(). So the existing location for
stamping the softlockup thread should be fine.

> One point here is to cover both stolen and blocked time to
> compare with offline threshold. vcpu hotplug falls into 'stolen'
> case, but it's not enough. Considering xen time model is tickless
> at idle, it's possible that big block time is requested which
> also inflames softlockup thread.

Every vcpu has a softlockup thread which regularly sleeps for some short
period. If the vcpu sets a timeout beyond that sleep time then we have a
bug. We shouldn't need to take into account blocked time -- Xen already
ensures that wakeup latency is accounted as stolen time. Blocked time only
includes time which the vcpu was willing to give up because it had no work
to do.

-- Keir

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

RE: [PATCH] Fix softlockup issue after vcpu hotplug [ In reply to ]

kevin.tian at intel

Jan 30, 2007, 2:54 AM

Post #3 of 33 (492 views)

Permalink

>From: Keir Fraser [mailto:Keir.Fraser@cl.cam.ac.uk]
>Sent: 2007Äê1ÔÂ30ÈÕ 17:38
>
>On 30/1/07 08:26, "Tian, Kevin" <kevin.tian@intel.com> wrote:
>
>> Stamp softlockup thread earlier before do_timer, because the
>> latter is the one to actually trigger lock warning for
>> long-time offline. Or else, I obserevd softlockup warning
>> easily at manual vcpu hot-remove/plug, or when suspend cancel
>> into old context.
>
>Actually the softlockup check is triggered from run_local_timers() which
>is
>called very near the end of timer_interrupt(). So the existing location for
>stamping the softlockup thread should be fine.

Yep, you're right. For this part, I looked at an old source. :-(

>
>> One point here is to cover both stolen and blocked time to
>> compare with offline threshold. vcpu hotplug falls into 'stolen'
>> case, but it's not enough. Considering xen time model is tickless
>> at idle, it's possible that big block time is requested which
>> also inflames softlockup thread.
>
>Every vcpu has a softlockup thread which regularly sleeps for some
>short
>period. If the vcpu sets a timeout beyond that sleep time then we have a
>bug. We shouldn't need to take into account blocked time -- Xen already
>ensures that wakeup latency is accounted as stolen time. Blocked time
>only
>includes time which the vcpu was willing to give up because it had no
>work
>to do.
>

If we don't take into account blocked time, maybe we have to disable
softlockup check. Say an idle process gets a timeout value larger than
10s by next_timer_interrupt, and then blocked. If, unfortunately, there's
no other events happening before that timeout value, this vcpu will see
softlockup warning after that timeout immediately since this period is
not categorized into stolen time.

For example, when I hotremove and then hot-plug a vcpu on domU by:

Echo "0" > /sys/devices/system/cpu/cpu3/online
Echo "1" > /sys/devices/system/cpu/cpu3/online

After cpu3 is up, idle process sometimes get a big timeout value
(0x40000000) by next_timer_interrupt. Then virtual timer for that vcpu
is disabled, and vcpu itself blocks. Sometime later (larger than 10s),
other events (like IPI) may wake this vcpu. In this case, if without
including blocked time, I think it difficult to prevent softlockup warning

Another simple approach to trigger such warning is to let
__xen_suspend() jumps to smp_resume immediately after
smp_suspend, as a test case for suspend cancel. People can
observe all vcpus except vcpu0 fall into that warning frequently.

Thanks,
Kevin

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Re: [PATCH] Fix softlockup issue after vcpu hotplug [ In reply to ]

Keir.Fraser at cl

Jan 30, 2007, 3:08 AM

Post #4 of 33 (494 views)

Permalink

On 30/1/07 09:54, "Tian, Kevin" <kevin.tian@intel.com> wrote:

> If we don't take into account blocked time, maybe we have to disable
> softlockup check. Say an idle process gets a timeout value larger than
> 10s by next_timer_interrupt, and then blocked. If, unfortunately, there's
> no other events happening before that timeout value, this vcpu will see
> softlockup warning after that timeout immediately since this period is
> not categorized into stolen time.

Presumably softlockup threads are killed and re-created when VCPUs are
offlined and onlined. Perhaps the re-creation is taking a long time? But 10s
would be a *very* long time. And once it is created and bound to the correct
VCPU we should never see long timeouts when blocking (since softlockup
thread timeout is never longer than a few seconds).

Perhaps there is a bug in our cpu onlining code -- a big timeout like that
does need investigating. I don't think we can claim this bug is root-caused
yet so it's premature to be applying patches.

-- Keir

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Re: [PATCH] Fix softlockup issue after vcpu hotplug [ In reply to ]

Keir.Fraser at cl

Jan 30, 2007, 3:10 AM

Post #5 of 33 (493 views)

Permalink

On 30/1/07 09:54, "Tian, Kevin" <kevin.tian@intel.com> wrote:

> Another simple approach to trigger such warning is to let
> __xen_suspend() jumps to smp_resume immediately after
> smp_suspend, as a test case for suspend cancel. People can
> observe all vcpus except vcpu0 fall into that warning frequently.

Do you know if this problem has been observed across many versions of Xen or
e.g., only after the upgrade to 2.6.18?

-- Keir

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

RE: [PATCH] Fix softlockup issue after vcpu hotplug [ In reply to ]

kevin.tian at intel

Jan 30, 2007, 5:11 AM

Post #6 of 33 (493 views)

Permalink

>From: Keir Fraser [mailto:Keir.Fraser@cl.cam.ac.uk]
>Sent: 2007Äê1ÔÂ30ÈÕ 18:09
>
>On 30/1/07 09:54, "Tian, Kevin" <kevin.tian@intel.com> wrote:
>
>> If we don't take into account blocked time, maybe we have to disable
>> softlockup check. Say an idle process gets a timeout value larger than
>> 10s by next_timer_interrupt, and then blocked. If, unfortunately, there's
>> no other events happening before that timeout value, this vcpu will see
>> softlockup warning after that timeout immediately since this period is
>> not categorized into stolen time.
>
>Presumably softlockup threads are killed and re-created when VCPUs
>are
>offlined and onlined. Perhaps the re-creation is taking a long time? But

That should not be the case, since the softlockup warning continues
to jump out after cpu is brought online.

>10s
>would be a *very* long time. And once it is created and bound to the
>correct
>VCPU we should never see long timeouts when blocking (since
>softlockup
>thread timeout is never longer than a few seconds).

Yeah, I noted this point just after sending out the mail.

>
>Perhaps there is a bug in our cpu onlining code -- a big timeout like that
>does need investigating. I don't think we can claim this bug is
>root-caused
>yet so it's premature to be applying patches.
>

Agree. I'll do more investigation on this point. Just quickly compared
the watchdog thread between 2.6.18 and 2.6.16. Previously in 2.6.16,
an explicit schedule timeout with 1s is used, while 2.6.18 wakes up
the watchdog thread per second from timer interrupt (softlockup_tick).
One distinct difference on this change is, watchdog thread in 2.6.16
will have a soft timer registered while 2.6.18 not. I'm doubting that
this may make some difference to decision of next_timer_interrupt.

By the way, do you think whether scheduler may do something to
punish new-online vcpu? Just from code, I didn't see that since new
awaken vcpu is always boosted... However in the actual, I found
that virtual timer interrupt number increased slowly for that cpu by
'cat /proc/interrupts'. Sometimes it may even freeze for dozen of
seconds. But yes, this may the phenomenon instead of reason. :-)

Thanks,
Kevin

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

RE: [PATCH] Fix softlockup issue after vcpu hotplug [ In reply to ]

kevin.tian at intel

Jan 30, 2007, 5:14 AM

Post #7 of 33 (494 views)

Permalink

>From: Keir Fraser [mailto:Keir.Fraser@cl.cam.ac.uk]
>Sent: 2007Äê1ÔÂ30ÈÕ 18:10
>
>
>On 30/1/07 09:54, "Tian, Kevin" <kevin.tian@intel.com> wrote:
>
>> Another simple approach to trigger such warning is to let
>> __xen_suspend() jumps to smp_resume immediately after
>> smp_suspend, as a test case for suspend cancel. People can
>> observe all vcpus except vcpu0 fall into that warning frequently.
>
>Do you know if this problem has been observed across many versions of
>Xen or
>e.g., only after the upgrade to 2.6.18?
>
> -- Keir

Dunno yet. I just found this issue when adding light weight suspend,
and softlockup warning jumps out immediately after resuming back to
old context. Then I tried manual cpu hotplug with same finding. I'll try
an old 2.6.16 version to see whether it happens as comparison.

Thanks,
Kevin

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

RE: [PATCH] Fix softlockup issue after vcpu hotplug [ In reply to ]

kevin.tian at intel

Jan 30, 2007, 5:45 AM

Post #8 of 33 (492 views)

Permalink

Actually I'm a bit interested in this case, where watchdog thread
depends on timer interrupt to be awaken, while next timer interval
depends on soft timer wheel. For the new online cpu, all its
processes previously running have been migrated to others before
offline. Thus when just coming back online, there may be no
meaningful timer wheel and few activities on that vcpu. In this case,
a (LONG_MAX >> 1) may be returned as a big timeout.

So saying this new watchdog model, simply walking timer wheel is
not enough. Maybe we can force max timeout value to 1s in safe_halt
to special this case? I'll make a try on this. But this will make current
tick-less model to a bit tick-ful back. :-)

Thanks,
Kevin

>From: Tian Kevin
>Sent: 2007Äê1ÔÂ30ÈÕ 20:12
>
>>
>>Perhaps there is a bug in our cpu onlining code -- a big timeout like that
>>does need investigating. I don't think we can claim this bug is
>>root-caused
>>yet so it's premature to be applying patches.
>>
>
>Agree. I'll do more investigation on this point. Just quickly compared
>the watchdog thread between 2.6.18 and 2.6.16. Previously in 2.6.16,
>an explicit schedule timeout with 1s is used, while 2.6.18 wakes up
>the watchdog thread per second from timer interrupt (softlockup_tick).
>One distinct difference on this change is, watchdog thread in 2.6.16
>will have a soft timer registered while 2.6.18 not. I'm doubting that
>this may make some difference to decision of next_timer_interrupt.
>
>By the way, do you think whether scheduler may do something to
>punish new-online vcpu? Just from code, I didn't see that since new
>awaken vcpu is always boosted... However in the actual, I found
>that virtual timer interrupt number increased slowly for that cpu by
>'cat /proc/interrupts'. Sometimes it may even freeze for dozen of
>seconds. But yes, this may the phenomenon instead of reason. :-)
>
>Thanks,
>Kevin
>

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Re: [PATCH] Fix softlockup issue after vcpu hotplug [ In reply to ]

Keir.Fraser at cl

Jan 30, 2007, 5:57 AM

Post #9 of 33 (494 views)

Permalink

On 30/1/07 12:45 pm, "Tian, Kevin" <kevin.tian@intel.com> wrote:

> Actually I'm a bit interested in this case, where watchdog thread
> depends on timer interrupt to be awaken, while next timer interval
> depends on soft timer wheel. For the new online cpu, all its
> processes previously running have been migrated to others before
> offline. Thus when just coming back online, there may be no
> meaningful timer wheel and few activities on that vcpu. In this case,
> a (LONG_MAX >> 1) may be returned as a big timeout.

Yeah, but the thread should get migrated back again (or recreated) in fairly
short order. I think we can agree it should take rather less than 10
seconds. :-)

> So saying this new watchdog model, simply walking timer wheel is
> not enough. Maybe we can force max timeout value to 1s in safe_halt
> to special this case? I'll make a try on this. But this will make current
> tick-less model to a bit tick-ful back. :-)

I'm sure this will fix the issue. But who knows what real underlying issue
it might be hiding?

-- Keir

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Re: [PATCH] Fix softlockup issue after vcpu hotplug [ In reply to ]

Keir.Fraser at cl

Jan 30, 2007, 5:58 AM

Post #10 of 33 (492 views)

Permalink

On 30/1/07 12:11 pm, "Tian, Kevin" <kevin.tian@intel.com> wrote:

>> Presumably softlockup threads are killed and re-created when VCPUs
>> are
>> offlined and onlined. Perhaps the re-creation is taking a long time? But
>
> That should not be the case, since the softlockup warning continues
> to jump out after cpu is brought online.

You are confusing the two parts of the softlockup mechanism. The thread is
responsible only for periodically touching the watchdog. The warning
mechanism is driven off the timer interrupt handler. So it is entirely
possible for warnings to appear when the thread does not exist (in fact, if
the thread does not exist then we *expect* warnings to appear!).

-- Keir

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Re: [PATCH] Fix softlockup issue after vcpu hotplug [ In reply to ]

Keir.Fraser at cl

Jan 30, 2007, 6:01 AM

Post #11 of 33 (492 views)

Permalink

On 30/1/07 12:57 pm, "Keir Fraser" <Keir.Fraser@cl.cam.ac.uk> wrote:

>> So saying this new watchdog model, simply walking timer wheel is
>> not enough. Maybe we can force max timeout value to 1s in safe_halt
>> to special this case? I'll make a try on this. But this will make current
>> tick-less model to a bit tick-ful back. :-)
>
> I'm sure this will fix the issue. But who knows what real underlying issue
> it might be hiding?

There could be a bug in next_timer_event(), for example. Maybe events a long
way out (multiple seconds) don't always get considered but we are normally
saved by the fact that CPUs have a few sooner events also queued up. But
that may not be the case for a newly-onlined CPU.

This is just an example hypothesis to explain why we need to properly track
this down.

-- Keir

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

RE: [PATCH] Fix softlockup issue after vcpu hotplug [ In reply to ]

kevin.tian at intel

Jan 30, 2007, 6:09 AM

Post #12 of 33 (494 views)

Permalink

>From: Keir Fraser [mailto:Keir.Fraser@cl.cam.ac.uk]
>Sent: 2007Äê1ÔÂ30ÈÕ 20:57
>
>On 30/1/07 12:45 pm, "Tian, Kevin" <kevin.tian@intel.com> wrote:
>
>> Actually I'm a bit interested in this case, where watchdog thread
>> depends on timer interrupt to be awaken, while next timer interval
>> depends on soft timer wheel. For the new online cpu, all its
>> processes previously running have been migrated to others before
>> offline. Thus when just coming back online, there may be no
>> meaningful timer wheel and few activities on that vcpu. In this case,
>> a (LONG_MAX >> 1) may be returned as a big timeout.
>
>Yeah, but the thread should get migrated back again (or recreated) in
>fairly
>short order. I think we can agree it should take rather less than 10
>seconds. :-)

So my test is on an 'idle' domain which does nothing. In this case, I'm
not sure whether processes except those per-cpu kernel threads will
be migrated back when one cpu is still easy to handle them. For the
per-cpu kernel threads, yes they'll be re-created, but will they be
awaken immediately within 10s to do anything when there's no
meaningful workload on that cpu? Actually this bug may not show
when domain is under heavy load...

>
>> So saying this new watchdog model, simply walking timer wheel is
>> not enough. Maybe we can force max timeout value to 1s in safe_halt
>> to special this case? I'll make a try on this. But this will make current
>> tick-less model to a bit tick-ful back. :-)
>
>I'm sure this will fix the issue. But who knows what real underlying issue
>it might be hiding?
>
> -- Keir

I'm not sure whether it hides something. But the current situation
seems like a self-trap to me: watchdog waits for timer interrupt to be
awaken in 1s interval, while timer interrupt deliberately schedules a
longer interval without considering watchdog and then blames
watchdog thread not running within 10s. :-)

Thanks,
Kevin

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

RE: [PATCH] Fix softlockup issue after vcpu hotplug [ In reply to ]

kevin.tian at intel

Jan 30, 2007, 6:12 AM

Post #13 of 33 (492 views)

Permalink

>From: Keir Fraser [mailto:Keir.Fraser@cl.cam.ac.uk]
>Sent: 2007Äê1ÔÂ30ÈÕ 20:59
>On 30/1/07 12:11 pm, "Tian, Kevin" <kevin.tian@intel.com> wrote:
>
>>> Presumably softlockup threads are killed and re-created when
>VCPUs
>>> are
>>> offlined and onlined. Perhaps the re-creation is taking a long time?
>But
>>
>> That should not be the case, since the softlockup warning continues
>> to jump out after cpu is brought online.
>
>You are confusing the two parts of the softlockup mechanism. The thread
>is
>responsible only for periodically touching the watchdog. The warning
>mechanism is driven off the timer interrupt handler. So it is entirely
>possible for warnings to appear when the thread does not exist (in fact, if
>the thread does not exist then we *expect* warnings to appear!).
>
> -- Keir

I added a debug print inside the warning by:
printk(KERN_ERR "BUG: drift by 0x%lx\n",
jiffies - touch_timestamp);

This drift doesn't increment monotonically. Most time it is about 1s
(an interesting fact!), and seldom dozen of seconds. But anyway, it
indicates that watchdog thread is still scheduled. :-)

Thanks,
Kevin

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Re: [PATCH] Fix softlockup issue after vcpu hotplug [ In reply to ]

Keir.Fraser at cl

Jan 30, 2007, 6:12 AM

Post #14 of 33 (495 views)

Permalink

On 30/1/07 1:09 pm, "Tian, Kevin" <kevin.tian@intel.com> wrote:

>> I'm sure this will fix the issue. But who knows what real underlying issue
>> it might be hiding?
>>
>> -- Keir
>
> I'm not sure whether it hides something. But the current situation
> seems like a self-trap to me: watchdog waits for timer interrupt to be
> awaken in 1s interval, while timer interrupt deliberately schedules a
> longer interval without considering watchdog and then blames
> watchdog thread not running within 10s. :-)

Actually I think you're right -- if this fixes the issue then it points to a
problem in the next_timer_event code. So it would actually be interesting to
try clamping the timeout to one second.

-- Keir

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

RE: [PATCH] Fix softlockup issue after vcpu hotplug [ In reply to ]

kevin.tian at intel

Jan 30, 2007, 6:15 AM

Post #15 of 33 (498 views)

Permalink

>From: Tian Kevin
>Sent: 2007Äê1ÔÂ30ÈÕ 21:12
>>>>You are confusing the two parts of the softlockup mechanism. The
>thread
>>is
>>responsible only for periodically touching the watchdog. The warning
>>mechanism is driven off the timer interrupt handler. So it is entirely
>>possible for warnings to appear when the thread does not exist (in fact,
>if
>>the thread does not exist then we *expect* warnings to appear!).
>>
>> -- Keir
>
>I added a debug print inside the warning by:
> printk(KERN_ERR "BUG: drift by 0x%lx\n",
> jiffies - touch_timestamp);
>
>This drift doesn't increment monotonically. Most time it is about 1s
>(an interesting fact!), and seldom dozen of seconds. But anyway, it
>indicates that watchdog thread is still scheduled. :-)
>
>Thanks,
>Kevin
>

Sorry that I mean most time 10s here.

Thanks,
Kevin

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

RE: [PATCH] Fix softlockup issue after vcpu hotplug [ In reply to ]

kevin.tian at intel

Jan 30, 2007, 7:11 AM

Post #16 of 33 (493 views)

Permalink

>From: Keir Fraser [mailto:Keir.Fraser@cl.cam.ac.uk]
>Sent: 2007Äê1ÔÂ30ÈÕ 21:13
>On 30/1/07 1:09 pm, "Tian, Kevin" <kevin.tian@intel.com> wrote:
>
>>> I'm sure this will fix the issue. But who knows what real underlying
>issue
>>> it might be hiding?
>>>
>>> -- Keir
>>
>> I'm not sure whether it hides something. But the current situation
>> seems like a self-trap to me: watchdog waits for timer interrupt to be
>> awaken in 1s interval, while timer interrupt deliberately schedules a
>> longer interval without considering watchdog and then blames
>> watchdog thread not running within 10s. :-)
>
>Actually I think you're right -- if this fixes the issue then it points to a
>problem in the next_timer_event code. So it would actually be interesting
>to
>try clamping the timeout to one second.
>
> -- Keir

By a simple change like this:

@@ -962,7 +962,8 @@ u64 jiffies_to_st(unsigned long j)
} else if (((unsigned long)delta >> (BITS_PER_LONG-3)) != 0) {
/* Very long timeout means there is no pending timer.
* We indicate this to Xen by passing zero timeout. */
- st = 0;
+ //st = 0;
+ st = processed_system_time + HZ * (u64)NS_PER_TICK;
} else {
st = processed_system_time + delta * (u64)NS_PER_TICK;
}

I really expected to say it as the root fix, however I can't though
this change made it better. I created a domU with 4 VCPUs on
2 CPUs box, and tried to hot-remove/plug vcpu 1,2,3 alternatively.
After about ten rounds test, everything is just OK. However several
minutes later, I saw that warning again, though far less frequent
than before.

So I have to dig more into this bug. The first thing I plan to do, is to
make sure whether such long timeout is requested as what guest
wants, or it's xen to enlarge that timeout underlyingly... :-(

BTW, do you think whether it's worthy to destroy vcpu from
scheduler when it's down and then re-init that vcpu into scheduler
when it's on? I don't know whether this will make any influence to
accounting of scheduler. Actually domain save/restore doesn't show
this bug, and one obvious distinct compared to vcpu-hotplug is that
domain is restored in a new context...

Thanks,
Kevin

P.S. some trace log attached. You can see that drift in each warning is
just around 1000 ticks.
[root@localhost ~]# BUG: soft lockup detected on CPU#1!
BUG: drift by 0x41e
[<c0151301>] softlockup_tick+0xd1/0x100
[<c01095d4>] timer_interrupt+0x4e4/0x640
[<c011bbae>] try_to_wake_up+0x24e/0x300
[<c0151c89>] handle_IRQ_event+0x59/0xa0
[<c0151d65>] __do_IRQ+0x95/0x120
[<c010708f>] do_IRQ+0x3f/0xa0
[<c0103070>] xen_idle+0x0/0x60
[<c024e355>] evtchn_do_upcall+0xb5/0x120
[<c0103070>] xen_idle+0x0/0x60
[<c01057a5>] hypervisor_callback+0x3d/0x48
[<c0103070>] xen_idle+0x0/0x60
[<c0109d40>] raw_safe_halt+0x20/0x50
[<c01030a1>] xen_idle+0x31/0x60
[<c010316e>] cpu_idle+0x9e/0xf0
BUG: soft lockup detected on CPU#2!
BUG: drift by 0x447
[<c0151301>] softlockup_tick+0xd1/0x100
[<c01095d4>] timer_interrupt+0x4e4/0x640
[<c011bbae>] try_to_wake_up+0x24e/0x300
[<c0151c89>] handle_IRQ_event+0x59/0xa0
[<c0151d65>] __do_IRQ+0x95/0x120
[<c010708f>] do_IRQ+0x3f/0xa0
[<c0103070>] xen_idle+0x0/0x60
[<c024e355>] evtchn_do_upcall+0xb5/0x120
[<c0103070>] xen_idle+0x0/0x60
[<c01057a5>] hypervisor_callback+0x3d/0x48
[<c0103070>] xen_idle+0x0/0x60
[<c0109d40>] raw_safe_halt+0x20/0x50
[<c01030a1>] xen_idle+0x31/0x60
[<c010316e>] cpu_idle+0x9e/0xf0
BUG: soft lockup detected on CPU#1!
BUG: drift by 0x43f
[<c0151301>] softlockup_tick+0xd1/0x100
[<c01095d4>] timer_interrupt+0x4e4/0x640
[<c011bbae>] try_to_wake_up+0x24e/0x300
[<c0151c89>] handle_IRQ_event+0x59/0xa0
[<c0151d65>] __do_IRQ+0x95/0x120
[<c010708f>] do_IRQ+0x3f/0xa0
[<c0103070>] xen_idle+0x0/0x60
[<c024e355>] evtchn_do_upcall+0xb5/0x120
[<c0103070>] xen_idle+0x0/0x60
[<c01057a5>] hypervisor_callback+0x3d/0x48
[<c0103070>] xen_idle+0x0/0x60
[<c0109d40>] raw_safe_halt+0x20/0x50
[<c01030a1>] xen_idle+0x31/0x60
[<c010316e>] cpu_idle+0x9e/0xf0
BUG: soft lockup detected on CPU#1!
BUG: drift by 0x3ea
[<c0151301>] softlockup_tick+0xd1/0x100
[<c01095d4>] timer_interrupt+0x4e4/0x640
[<c0137699>] __rcu_process_callbacks+0x99/0x100
[<c0129867>] tasklet_action+0x87/0x130
[<c0151c89>] handle_IRQ_event+0x59/0xa0
[<c0151d65>] __do_IRQ+0x95/0x120
[<c010708f>] do_IRQ+0x3f/0xa0
[<c0103070>] xen_idle+0x0/0x60
[<c024e355>] evtchn_do_upcall+0xb5/0x120
[<c0103070>] xen_idle+0x0/0x60
[<c01057a5>] hypervisor_callback+0x3d/0x48
[<c0103070>] xen_idle+0x0/0x60
[<c0109d40>] raw_safe_halt+0x20/0x50
[<c01030a1>] xen_idle+0x31/0x60
[<c010316e>] cpu_idle+0x9e/0xf0

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Re: [PATCH] Fix softlockup issue after vcpu hotplug [ In reply to ]

Keir.Fraser at cl

Jan 30, 2007, 7:22 AM

Post #17 of 33 (497 views)

Permalink

On 30/1/07 2:11 pm, "Tian, Kevin" <kevin.tian@intel.com> wrote:

> BTW, do you think whether it's worthy to destroy vcpu from
> scheduler when it's down and then re-init that vcpu into scheduler
> when it's on? I don't know whether this will make any influence to
> accounting of scheduler. Actually domain save/restore doesn't show
> this bug, and one obvious distinct compared to vcpu-hotplug is that
> domain is restored in a new context...

I wouldn't expect this to make any significant difference to scheduling
accounting, certainly over a multi-second time period.

Does the time you hoy-unplug the vcpu for make a difference to how often you
see this problem? Did you try repro'ing with a 2.6.16 kernel?

-- Keir

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

RE: [PATCH] Fix softlockup issue after vcpu hotplug [ In reply to ]

kevin.tian at intel

Jan 30, 2007, 7:33 AM

Post #18 of 33 (496 views)

Permalink

>From: Keir Fraser [mailto:Keir.Fraser@cl.cam.ac.uk]
>Sent: 2007Äê1ÔÂ30ÈÕ 22:23
>
>On 30/1/07 2:11 pm, "Tian, Kevin" <kevin.tian@intel.com> wrote:
>
>> BTW, do you think whether it's worthy to destroy vcpu from
>> scheduler when it's down and then re-init that vcpu into scheduler
>> when it's on? I don't know whether this will make any influence to
>> accounting of scheduler. Actually domain save/restore doesn't show
>> this bug, and one obvious distinct compared to vcpu-hotplug is that
>> domain is restored in a new context...
>
>I wouldn't expect this to make any significant difference to scheduling
>accounting, certainly over a multi-second time period.
>
>Does the time you hoy-unplug the vcpu for make a difference to how
>often you
>see this problem? Did you try repro'ing with a 2.6.16 kernel?
>
> -- Keir

I can't tell, since I didn't use same pace in each round by manual
operation. I tried both immediate plug after unplug, and a longer
interval than 10s. But the first warning jumped out when I finished
the test and ready to send out the 'good' news. :-(

I'll repro 2.6.16 kernel tomorrow, because remote box crashed just
now.

Thanks,
Kevin

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

RE: [PATCH] Fix softlockup issue after vcpu hotplug [ In reply to ]

kevin.tian at intel

Jan 30, 2007, 7:54 AM

Post #19 of 33 (494 views)

Permalink

>From: Tian, Kevin
>Sent: 2007Äê1ÔÂ30ÈÕ 22:34
>>I wouldn't expect this to make any significant difference to scheduling
>>accounting, certainly over a multi-second time period.
>>
>>Does the time you hoy-unplug the vcpu for make a difference to how
>>often you
>>see this problem? Did you try repro'ing with a 2.6.16 kernel?
>>
>> -- Keir
>
>I can't tell, since I didn't use same pace in each round by manual
>operation. I tried both immediate plug after unplug, and a longer
>interval than 10s. But the first warning jumped out when I finished
>the test and ready to send out the 'good' news. :-(
>
>I'll repro 2.6.16 kernel tomorrow, because remote box crashed just
>now.
>
>Thanks,
>Kevin

I have to say previous change incomplete, because it only limit
timeout to 1s for very long timeout case (BITS_PER_LONG-3),
while exclude the case in the middle. I should keep that 1s limit
on all branches, in case there're some not very long, but longer
than 10s timeout is hit. However due to crashed box, I have to
verify it tomorrow too.

Thanks,
Kevin

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

RE: [PATCH] Fix softlockup issue after vcpu hotplug [ In reply to ]

Simon.Graham at stratus

Jan 30, 2007, 12:29 PM

Post #20 of 33 (491 views)

Permalink

> On 30/1/07 09:54, "Tian, Kevin" <kevin.tian@intel.com> wrote:
>
> > Another simple approach to trigger such warning is to let
> > __xen_suspend() jumps to smp_resume immediately after
> > smp_suspend, as a test case for suspend cancel. People can
> > observe all vcpus except vcpu0 fall into that warning frequently.
>
> Do you know if this problem has been observed across many versions of
> Xen or
> e.g., only after the upgrade to 2.6.18?
>

I'm not sure but I think that we've been seeing something very similar
when live migrating domains with 3.0.3/2.6.16.29) -- my understanding is
that the live migration code takes the domain down to UP, does the
migration and then restores SMP -- we VERY often see soft lockup
messages following this (several times per night in our regression
testing) with stack traces identical to those posted by Kevin.

I also added some instrumentation and in every single case, the 'stolen'
time is > 5s when we see the soft lockup.

Simon

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

RE: [PATCH] Fix softlockup issue after vcpu hotplug [ In reply to ]

kevin.tian at intel

Jan 30, 2007, 10:42 PM

Post #21 of 33 (494 views)

Permalink

>From: Graham, Simon [mailto:Simon.Graham@stratus.com]
>Sent: 2007Äê1ÔÂ31ÈÕ 3:29
>> On 30/1/07 09:54, "Tian, Kevin" <kevin.tian@intel.com> wrote:
>>
>> > Another simple approach to trigger such warning is to let
>> > __xen_suspend() jumps to smp_resume immediately after
>> > smp_suspend, as a test case for suspend cancel. People can
>> > observe all vcpus except vcpu0 fall into that warning frequently.
>>
>> Do you know if this problem has been observed across many versions
>of
>> Xen or
>> e.g., only after the upgrade to 2.6.18?
>>
>
>I'm not sure but I think that we've been seeing something very similar
>when live migrating domains with 3.0.3/2.6.16.29) -- my understanding is
>that the live migration code takes the domain down to UP, does the
>migration and then restores SMP -- we VERY often see soft lockup
>messages following this (several times per night in our regression
>testing) with stack traces identical to those posted by Kevin.
>
>I also added some instrumentation and in every single case, the 'stolen'
>time is > 5s when we see the soft lockup.
>
>Simon

Hi, Simon,
You case should be different as what I saw, which may be fixed
by the original patch I posted which however doesn't apply to latest.
In 2.6.16 version, it's do_timer to call softlock_tick instead of
run_local_timers. So the check on "stolen > 5s" is a bit late to still
allow warning jumped out though adjusted later. Could you try
attached patch to see whether fixing for your live migration case?

Thanks,
Kevin

RE: [PATCH] Fix softlockup issue after vcpu hotplug [ In reply to ]

Simon.Graham at stratus

Jan 31, 2007, 1:27 PM

Post #22 of 33 (497 views)

Permalink

> Hi, Simon,
> You case should be different as what I saw, which may be fixed
> by the original patch I posted which however doesn't apply to latest.
> In 2.6.16 version, it's do_timer to call softlock_tick instead of
> run_local_timers. So the check on "stolen > 5s" is a bit late to still
> allow warning jumped out though adjusted later. Could you try
> attached patch to see whether fixing for your live migration case?
>

Thanks - that explains why the original patch didn't work! I will try
this out and see how it goes.

Simon

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

RE: [PATCH] Fix softlockup issue after vcpu hotplug [ In reply to ]

Simon.Graham at stratus

Feb 1, 2007, 7:31 AM

Post #23 of 33 (493 views)

Permalink

Kevin,

>
> Hi, Simon,
> You case should be different as what I saw, which may be fixed
> by the original patch I posted which however doesn't apply to latest.
> In 2.6.16 version, it's do_timer to call softlock_tick instead of
> run_local_timers. So the check on "stolen > 5s" is a bit late to still
> allow warning jumped out though adjusted later. Could you try
> attached patch to see whether fixing for your live migration case?
>

So, I tried this last night - I don't see any problems following live
migration but I am still seeing soft lockups all of which are related to
cases where there is a large stolen value - I haven't looked at all the
logs yet, but I did see a couple of things:

1. There were a ton of occasions when the test for stolen > 5s fired but
the value of stolen
was actually negative - is a -ve stolen value expected? I think the
patch needs to
be modified to define stolen_threshold as s64 instead of u64 if this
is expected...

2. Following save/restore, I see absolutely massive positive values of
stolen of the order of the
time the domain was saved (seems reasonable) but then I immediately
see a soft lockup even though
we touched the watchdog. Shouldn't this patch also fix soft lockup
after save/restore?

3. I actually saw a bunch of cases where there was a mongo stolen value
during apparently normal
operation (in the ones I've looked at, the system as a whole was not
particularly stressed); I
need to work on exactly why the domain is not being secheduled, but
in the meantime, shouldn't
this patch stop the incorrect soft lockup in DomU when the hypervisor
fails to schedule the
domain for a long period? (not exactly related to VCPU hotplug I
know)

Simon

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Re: [PATCH] Fix softlockup issue after vcpu hotplug [ In reply to ]

Keir.Fraser at cl

Feb 1, 2007, 11:24 AM

Post #24 of 33 (493 views)

Permalink

On 1/2/07 14:31, "Graham, Simon" <Simon.Graham@stratus.com> wrote:

> 3. I actually saw a bunch of cases where there was a mongo stolen value
> during apparently normal
> operation (in the ones I've looked at, the system as a whole was not
> particularly stressed); I
> need to work on exactly why the domain is not being secheduled, but
> in the meantime, shouldn't
> this patch stop the incorrect soft lockup in DomU when the hypervisor
> fails to schedule the
> domain for a long period? (not exactly related to VCPU hotplug I
> know)

No, the patch that Kevin provided cannot work because it touches the
watchdog before jiffies has been updated. Since both jiffy update and
watchdog check happens inside do_timer(), this is a hard problem to fix for
Linux 2.6.16. You could push the watchdog touch inside the following loop
that calls do_timer(): I think that would work!

-- Keir

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

RE: [PATCH] Fix softlockup issue after vcpu hotplug [ In reply to ]

Simon.Graham at stratus

Feb 1, 2007, 11:54 AM

Post #25 of 33 (497 views)

Permalink

> No, the patch that Kevin provided cannot work because it touches the
> watchdog before jiffies has been updated. Since both jiffy update and
> watchdog check happens inside do_timer(), this is a hard problem to
fix
> for
> Linux 2.6.16. You could push the watchdog touch inside the following
> loop
> that calls do_timer(): I think that would work!
>

Thanks Keir -- I think it's time I moved to 3.0.4 and the later kernel!
Once I've done that, I'll get back to seeing if this issue still exists.

Simon

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Mailing List Archive

Attached Files:

Attached Files: