Mailing List Archive

1 2  View All
RE: [PATCH] Fix softlockup issue after vcpu hotplug [ In reply to ]
>
> No, the patch that Kevin provided cannot work because it touches the
> watchdog before jiffies has been updated. Since both jiffy update and
> watchdog check happens inside do_timer(), this is a hard problem to
fix
> for
> Linux 2.6.16. You could push the watchdog touch inside the following
> loop
> that calls do_timer(): I think that would work!
>

OK, I've spent a little time to really understand this today
(hopefully!) and I think I know now why none of the patches to date (for
2.6.16 anyway) work -- the problem is they only touched the wdt one time
BUT timer_interrupt in time-xen.c has a loop that repeatedly calls
do_timer to advance the jiffies and check for timeout until the entire
delta time since the last time called is accounted for... any single one
of those do_timer calls might result in a watchdog timer expiration.

It's also not really correct to only touch the watchdog if the stolen
time is > 5s -- you might be currently sitting at 8s since the watchdog
was last updated and get called after 2s of stolen time and that will
cause a timeout.

What's more, if you get called with more than 20s of stolen time (e.g.
after save/restore or pause/unpause), you really need to tickle the
watchdog timer multiple times (at least once for every 10s worth of
jiffies in the total stolen time).

So -- my proposal (patch attached for 2.6.16) is to touch the watchdog
inside the loop that calls do_timer(), right after the call IF the
remaining amount of stolen time is greater than NS_PER_TICK -- since
each call to do_timer advances jiffies by one, this could only go wrong
if there was only a single jiffy left until the watchdog timer expires
on entry and I think that's OK!

I also considered only touching the watchdog timer every 5s or so, but I
think the code to do that would have more overhead than simply touching
it for every do_timer() call (since it's just a call that copies jiffies
to the per-cpu watchdog timer value).

Take a look and let me know what you think (the printk could be removed
-- I just put it in so I could tell the code was running).

Simon
Re: [PATCH] Fix softlockup issue after vcpu hotplug [ In reply to ]
On 1/2/07 22:40, "Graham, Simon" <Simon.Graham@stratus.com> wrote:

> I also considered only touching the watchdog timer every 5s or so, but I
> think the code to do that would have more overhead than simply touching
> it for every do_timer() call (since it's just a call that copies jiffies
> to the per-cpu watchdog timer value).
>
> Take a look and let me know what you think (the printk could be removed
> -- I just put it in so I could tell the code was running).

The test inside the loop should check against NS_PER_TICK*100, not just 0.
You only want to override the usual running of the watchdog if you get a big
bunch of time stolen from you. Actually, five seconds (NS_PER_TICK*5*HZ)
would be good: no reason to make the comparison dependent on the duration of
a jiffy.

-- Keir



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel
RE: [PATCH] Fix softlockup issue after vcpu hotplug [ In reply to ]
> The test inside the loop should check against NS_PER_TICK*100, not
just
> 0.
> You only want to override the usual running of the watchdog if you get
> a big
> bunch of time stolen from you. Actually, five seconds
> (NS_PER_TICK*5*HZ)
> would be good: no reason to make the comparison dependent on the
> duration of
> a jiffy.
>

I thought about this - the problem is I don't know what the current
value of the watchdog is, so if stolen is greater than zero, I need to
do it once immediately and then once every 5s or so in the loop - I cant
just do it the first n times through the loop because then I might do
10s worth of jiffy updates following all the watchdog touches... (BTW -
the test for NS_PER_TICK*100 was just for the purposes of
instrumentation)

So - to move to a scheme where we only touch the watchdog every 5s of
simulated time, I'd need to track if it's been 5s since the last time I
did it... that would mean maintaining another variable to track when the
last time I updated the watchdog was and I thought this would actually
be more overhead than simply updating it everytime round the loop.

I do agree that the test should be against NS_PER_TICK rather than 0 -
I'll make that change.

If you really think it's bad to touch the watchdog on each loop, then
I'd suggest doing this I think:

int next_wd = 0;

/* System-wide jiffy work. */
while (delta >= NS_PER_TICK) {
delta -= NS_PER_TICK;
processed_system_time += NS_PER_TICK;
do_timer(regs);

if (adjust_watchdog >= NS_PER_TICK) {
if (next_wd == 0) {
/* Avoid lockup warnings */
touch_softlockup_watchdog();
next_wd = HZ*5; // Dont adjust again for
another 5s
}
else
next_wd--;

adjust_watchdog -= NS_PER_TICK;
}
}

Simon

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel
Re: [PATCH] Fix softlockup issue after vcpu hotplug [ In reply to ]
On 1/2/07 23:44, "Graham, Simon" <Simon.Graham@stratus.com> wrote:

> I thought about this - the problem is I don't know what the current
> value of the watchdog is, so if stolen is greater than zero, I need to
> do it once immediately and then once every 5s or so in the loop - I cant
> just do it the first n times through the loop because then I might do
> 10s worth of jiffy updates following all the watchdog touches... (BTW -
> the test for NS_PER_TICK*100 was just for the purposes of
> instrumentation)

I don't mean to touch it only every 5s in the loop, I mean to touch it every
time round the loop but only if stolen is greater than five seconds:

while (delta >= NS_PER_TICK) {
...;
if (stolen > <five seconds>)
touch_softlockup_watchdog();
}

The point is that you don't want to touch the watchdog whenever you have
small amounts of time stolen from you because that will happen very often
(wakeup latencies, preemption) and cause the watchdog to not do its job
properly and/or in a timely fashion when something *does* go wrong.

If you touch it just about every time you enter the timer ISR you may as
well disable the softlockup mechanism altogether! :-)

The only theoretical problem with this approach is if you got time stolen
that accumulated to more than five seconds, but this happened in two or more
bursts, back-to-back. Then no one stolen period would be enough to trigger
the touch, but also the guest may not be running for long enough to schedule
the softlockup thread. I really don't believe this would be an issue ever in
practise however, given sane scheduling parameters and load on the system.
If the system were loaded/configured so it could happen, the guest would be
in dire straits for other reasons.

-- Keir



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel
RE: [PATCH] Fix softlockup issue after vcpu hotplug [ In reply to ]
>From: Keir Fraser [mailto:Keir.Fraser@cl.cam.ac.uk]
>Sent: 2007Äê2ÔÂ2ÈÕ 2:25
>> 3. I actually saw a bunch of cases where there was a mongo stolen
>value
>> during apparently normal
>> operation (in the ones I've looked at, the system as a whole was
>not
>> particularly stressed); I
>> need to work on exactly why the domain is not being secheduled,
>but
>> in the meantime, shouldn't
>> this patch stop the incorrect soft lockup in DomU when the
>hypervisor
>> fails to schedule the
>> domain for a long period? (not exactly related to VCPU hotplug I
>> know)
>
>No, the patch that Kevin provided cannot work because it touches the
>watchdog before jiffies has been updated. Since both jiffy update and
>watchdog check happens inside do_timer(), this is a hard problem to fix
>for
>Linux 2.6.16. You could push the watchdog touch inside the following
>loop
>that calls do_timer(): I think that would work!
>
> -- Keir

Agree.

Thanks,
Kevin

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel
RE: [PATCH] Fix softlockup issue after vcpu hotplug [ In reply to ]
>
> I don't mean to touch it only every 5s in the loop, I mean to touch it
> every
> time round the loop but only if stolen is greater than five seconds:
>

Ah right -- got it now; good point.

> The only theoretical problem with this approach is if you got time
> stolen
> that accumulated to more than five seconds, but this happened in two
or
> more
> bursts, back-to-back. Then no one stolen period would be enough to
> trigger
> the touch, but also the guest may not be running for long enough to
> schedule
> the softlockup thread. I really don't believe this would be an issue
> ever in
> practise however, given sane scheduling parameters and load on the
> system.
> If the system were loaded/configured so it could happen, the guest
> would be
> in dire straits for other reasons.

How about using a slightly smaller value like 1 or 2 s -- something
larger than the expected wakeup latency etc but small enough that it
would take multiple back-to-back bursts to hit 10s...

Simon

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel
RE: [PATCH] Fix softlockup issue after vcpu hotplug [ In reply to ]
>From: Graham, Simon [mailto:Simon.Graham@stratus.com]
>Sent: 2007Äê2ÔÂ2ÈÕ 11:47
>> The only theoretical problem with this approach is if you got time
>> stolen
>> that accumulated to more than five seconds, but this happened in two
>or
>> more
>> bursts, back-to-back. Then no one stolen period would be enough to
>> trigger
>> the touch, but also the guest may not be running for long enough to
>> schedule
>> the softlockup thread. I really don't believe this would be an issue
>> ever in
>> practise however, given sane scheduling parameters and load on the
>> system.
>> If the system were loaded/configured so it could happen, the guest
>> would be
>> in dire straits for other reasons.
>
>How about using a slightly smaller value like 1 or 2 s -- something
>larger than the expected wakeup latency etc but small enough that it
>would take multiple back-to-back bursts to hit 10s...
>
>Simon

If you really concern this value, how about make it configurable with
a default value? Or even a boot option? Smaller the value is, the effect
of watchdog thread to check weird behavior is wakened. On the
contrary, bigger value enlarges possibility of accumulated stolen time.
However for now, anyway we have no concrete example to prove
how frequent back-to-back bursts may happen and whether
accumulated case does happen. Most likely that may happen in
some heavy loaded system with many vcpus. But in that case, maybe
scalability issue on other areas will jump out first. :-)

Thanks,
Kevin

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel
Re: [PATCH] Fix softlockup issue after vcpu hotplug [ In reply to ]
On 2/2/07 03:47, "Graham, Simon" <Simon.Graham@stratus.com> wrote:

> How about using a slightly smaller value like 1 or 2 s -- something
> larger than the expected wakeup latency etc but small enough that it
> would take multiple back-to-back bursts to hit 10s...

The choice of 5s was hardly scientific. I would say a choice of between say
2s and 5s is just fine and unlikely to really affect likelihood of false
positives.

-- Keir



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

1 2  View All