Mailing List Archive

[xen-unstable test] 6374: regressions - FAIL
flight 6374 xen-unstable real [real]
http://www.chiark.greenend.org.uk/~xensrcts/logs/6374/

Regressions :-(

Tests which did not succeed and are blocking:
test-amd64-i386-pv 5 xen-boot fail REGR. vs. 6369

Tests which did not succeed, but are not blocking,
including regressions (tests previously passed) regarded as allowable:
test-amd64-amd64-win 16 leak-check/check fail never pass
test-amd64-amd64-xl-win 13 guest-stop fail never pass
test-amd64-i386-rhel6hvm-amd 8 guest-saverestore fail never pass
test-amd64-i386-rhel6hvm-intel 8 guest-saverestore fail never pass
test-amd64-i386-win-vcpus1 16 leak-check/check fail never pass
test-amd64-i386-win 16 leak-check/check fail never pass
test-amd64-i386-xl-credit2 9 guest-start fail like 6367
test-amd64-i386-xl-win-vcpus1 13 guest-stop fail never pass
test-amd64-xcpkern-i386-rhel6hvm-amd 8 guest-saverestore fail never pass
test-amd64-xcpkern-i386-rhel6hvm-intel 8 guest-saverestore fail never pass
test-amd64-xcpkern-i386-win 16 leak-check/check fail never pass
test-amd64-xcpkern-i386-xl-credit2 11 guest-localmigrate fail like 6369
test-amd64-xcpkern-i386-xl-win 13 guest-stop fail never pass
test-i386-i386-win 16 leak-check/check fail never pass
test-i386-i386-xl-win 13 guest-stop fail never pass
test-i386-xcpkern-i386-win 16 leak-check/check fail never pass

version targeted for testing:
xen 22cc047eb146
baseline version:
xen 6fa299ad15c8

------------------------------------------------------------
People who touched revisions under test:
Ian Campbell <ian.campbell@citrix.com>
Ian Jackson <ian.jackson@eu.citrix.com>
Jan Beulich <jbeulich@novell.com>
Jim Fehlig <jfehlig@novell.com>
Liu, Jinsong <jinsong.liu@intel.com>
Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Wei Gang <gang.wei@intel.com>
------------------------------------------------------------

jobs:
build-i386-xcpkern pass
build-amd64 pass
build-i386 pass
build-amd64-oldkern pass
build-i386-oldkern pass
build-amd64-pvops pass
build-i386-pvops pass
test-amd64-amd64-xl pass
test-amd64-i386-xl pass
test-i386-i386-xl pass
test-amd64-xcpkern-i386-xl pass
test-i386-xcpkern-i386-xl pass
test-amd64-i386-rhel6hvm-amd fail
test-amd64-xcpkern-i386-rhel6hvm-amd fail
test-amd64-i386-xl-credit2 fail
test-amd64-xcpkern-i386-xl-credit2 fail
test-amd64-i386-rhel6hvm-intel fail
test-amd64-xcpkern-i386-rhel6hvm-intel fail
test-amd64-i386-xl-multivcpu pass
test-amd64-xcpkern-i386-xl-multivcpu pass
test-amd64-amd64-pair pass
test-amd64-i386-pair pass
test-i386-i386-pair pass
test-amd64-xcpkern-i386-pair pass
test-i386-xcpkern-i386-pair pass
test-amd64-amd64-pv pass
test-amd64-i386-pv fail
test-i386-i386-pv pass
test-amd64-xcpkern-i386-pv pass
test-i386-xcpkern-i386-pv pass
test-amd64-i386-win-vcpus1 fail
test-amd64-i386-xl-win-vcpus1 fail
test-amd64-amd64-win fail
test-amd64-i386-win fail
test-i386-i386-win fail
test-amd64-xcpkern-i386-win fail
test-i386-xcpkern-i386-win fail
test-amd64-amd64-xl-win fail
test-i386-i386-xl-win fail
test-amd64-xcpkern-i386-xl-win fail


------------------------------------------------------------
sg-report-flight on woking.cam.xci-test.com
logs: /home/xc_osstest/logs
images: /home/xc_osstest/images

Logs, config files, etc. are available at
http://www.chiark.greenend.org.uk/~xensrcts/logs

Test harness code can be found at
http://xenbits.xensource.com/gitweb?p=osstest.git;a=summary


Not pushing.

------------------------------------------------------------
changeset: 23020:22cc047eb146
tag: tip
user: Liu, Jinsong <jinsong.liu@intel.com>
date: Thu Mar 10 18:35:32 2011 +0000

x86: Fix cpuidle bug

Before invoking C3, bus master disable / flush cache should be the
last step; After resume from C3, bus master enable should be the first
step;

Signed-off-by: Liu, Jinsong <jinsong.liu@intel.com>
Acked-by: Wei Gang <gang.wei@intel.com>


changeset: 23019:c8947c24536a
user: Ian Campbell <ian.campbell@citrix.com>
date: Thu Mar 10 18:21:42 2011 +0000

libxl: do not rely on guest to respond when forcing pci device removal

This is consistent with the expected semantics of a forced device
removal and also avoids a delay when destroying an HVM domain which
either does not support hot unplug (does not respond to SCI) or has
crashed.

Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Committed-by: Ian Jackson <ian.jackson@eu.citrix.com>


changeset: 23018:a46101334ee2
user: Jim Fehlig <jfehlig@novell.com>
date: Thu Mar 10 18:17:16 2011 +0000

libxl: Call setsid(2) before exec'ing device model

While doing development on libvirt libxenlight driver I noticed
that terminating a libxenlight client causes any qemu-dm
processes that were indirectly created by the client to also
terminate. Calling setsid(2) before exec'ing qemu-dm resolves
the issue.

Signed-off-by: Jim Fehlig <jfehlig@novell.com>
Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
Acked-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Committed-by: Ian Jackson <ian.jackson@eu.citrix.com>


changeset: 23017:b16644e446ef
user: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
date: Thu Mar 10 18:11:31 2011 +0000

update README

update README: we are missing few compile time dependencies and a link
to the pvops kernel page on the wiki.

Signed-off-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com>


changeset: 23016:6fa299ad15c8
user: Jan Beulich <jbeulich@novell.com>
date: Wed Mar 09 17:25:44 2011 +0000

x86: remove pre-686 CPU support bits

... as Xen doesn't run on such CPUs anyway. Clearly these bits were
particularly odd to have on x86-64.

Signed-off-by: Jan Beulich <jbeulich@novell.com>


(qemu changes not included)

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel
Re: [xen-unstable test] 6374: regressions - FAIL [ In reply to ]
xen.org writes ("[Xen-devel] [xen-unstable test] 6374: regressions - FAIL"):
> flight 6374 xen-unstable real [real]
> Tests which did not succeed and are blocking:
> test-amd64-i386-pv 5 xen-boot fail REGR. vs. 6369

Xen crash in scheduler (non-credit2).

Mar 11 13:46:53.646796 (XEN) Watchdog timer detects that CPU1 is stuck!
Mar 11 13:46:57.922794 (XEN) ----[ Xen-4.1.0-rc7-pre x86_64 debug=y Not tainted ]----
Mar 11 13:46:57.931763 (XEN) CPU: 1
Mar 11 13:46:57.931784 (XEN) RIP: e008:[<ffff82c480100140>] __bitmap_empty+0x0/0x7f
Mar 11 13:46:57.931817 (XEN) RFLAGS: 0000000000000047 CONTEXT: hypervisor
Mar 11 13:46:57.946773 (XEN) rax: ffff82c4802d1ac0 rbx: ffff8301a7fafc78 rcx: 0000000000000002
Mar 11 13:46:57.946813 (XEN) rdx: ffff82c4802d0cc0 rsi: 0000000000000080 rdi: ffff8301a7fafc78
Mar 11 13:46:57.954780 (XEN) rbp: ffff8301a7fafcb8 rsp: ffff8301a7fafc00 r8: 0000000000000002
Mar 11 13:46:57.966770 (XEN) r9: 0000ffff0000ffff r10: 00ff00ff00ff00ff r11: 0f0f0f0f0f0f0f0f
Mar 11 13:46:57.966805 (XEN) r12: ffff8301a7fafc68 r13: 0000000000000001 r14: 0000000000000001
Mar 11 13:46:57.975780 (XEN) r15: ffff82c4802d1ac0 cr0: 000000008005003b cr4: 00000000000006f0
Mar 11 13:46:57.987771 (XEN) cr3: 00000000d7c9c000 cr2: 00000000c45e5770
Mar 11 13:46:57.987800 (XEN) ds: 007b es: 007b fs: 00d8 gs: 0033 ss: 0000 cs: e008
Mar 11 13:46:57.998773 (XEN) Xen stack trace from rsp=ffff8301a7fafc00:
Mar 11 13:46:57.998802 (XEN) ffff82c480119557 00007cfe580503c7 ffff82c4802d1ac0 ffff82c4802d0cc0
Mar 11 13:46:58.010781 (XEN) ffff82c4802d1ad0 ffff82c4802d1ad0 ffff82c4802d1ad0 ffff82c4802d1ad0
Mar 11 13:46:58.019765 (XEN) 01ff8301a7fb3048 0000000000000800 0000000000000000 0000000000000100
Mar 11 13:46:58.019798 (XEN) 0000000000000000 0000000000000f02 0000000000000000 0000000000000f00
Mar 11 13:46:58.031777 (XEN) 0000000000000000 ffff8301a7fb3048 ffff8301a7fb3048 ffff8301a7eac048
Mar 11 13:46:58.039906 (XEN) 0000000000000001 ffff82c4802d0cc0 0000000000000000 ffff8301a7fafcc8
Mar 11 13:46:58.039930 (XEN) ffff82c480119582 ffff8301a7fafd28 ffff82c480122c8d 0000000100000001
Mar 11 13:46:58.051781 (XEN) ffff82c4802d0cc0 ffff82c4802d0cc0 ffff8300d7cdc000 0000000000000206
Mar 11 13:46:58.063769 (XEN) ffff8300d7cdc000 0000000000000001 ffff8300d7cdc000 000000018b4d75e5
Mar 11 13:46:58.063807 (XEN) ffff8301a7fb3040 ffff8301a7fafd48 ffff82c480122e24 ed543b2d00000000
Mar 11 13:46:58.075781 (XEN) ffff8300d7afc000 ffff8301a7fafe38 ffff82c480157f17 ffff82c480123dd4
Mar 11 13:46:58.087771 (XEN) ffff82c4802d0cc8 ffff8301a7fafe38 ffff82c480118c8a ffff82c4802d0cc0
Mar 11 13:46:58.098761 (XEN) ffff82c4802d0cc0 ffff82c4802d0cc0 ffff82c4802d0cc0 ffff82c4802d0cc0
Mar 11 13:46:58.098797 (XEN) 000000018b4d75e5 ffff8301a7fafe68 00000001a7e80e70 ffff8301a7ffa400
Mar 11 13:46:58.110773 (XEN) ffff8301a7ffaee8 ffff8301a7fafdf8 ffc08301a7ffaf90 0000000000000086
Mar 11 13:46:58.119760 (XEN) ffff8301a7fafdf8 ffff82c480123b91 0000000000000001 0000000000000000
Mar 11 13:46:58.119794 (XEN) 0000000000000000 ffff8301a7fafe38 ffff8300d7afc000 ffff8300d7cdc000
Mar 11 13:46:58.134790 (XEN) 0000000000000003 000000018b4d75e5 ffff8301a7fb3040 ffff8301a7fafeb8
Mar 11 13:46:58.139763 (XEN) ffff82c4801226b4 ffff8301a7fafe68 000000018b4d75e5 ffff8301a7fb3100
Mar 11 13:46:58.139804 (XEN) ffff8300d7cdc060 ffff8300d7afc000 ffffffffffffffff ffff8301a7faff00
Mar 11 13:46:58.154777 (XEN) Xen call trace:
Mar 11 13:46:58.154798 (XEN) [<ffff82c480100140>] __bitmap_empty+0x0/0x7f
Mar 11 13:46:58.163767 (XEN) [<ffff82c480119582>] csched_cpu_pick+0xe/0x10
Mar 11 13:46:58.163802 (XEN) [<ffff82c480122c8d>] vcpu_migrate+0xfb/0x230
Mar 11 13:46:58.178768 (XEN) [<ffff82c480122e24>] context_saved+0x62/0x7b
Mar 11 13:46:58.178799 (XEN) [<ffff82c480157f17>] context_switch+0xd98/0xdca
Mar 11 13:46:58.183766 (XEN) [<ffff82c4801226b4>] schedule+0x5fc/0x624
Mar 11 13:46:58.183795 (XEN) [<ffff82c480123837>] __do_softirq+0x88/0x99
Mar 11 13:46:58.198784 (XEN) [<ffff82c4801238b2>] do_softirq+0x6a/0x7a
Mar 11 13:46:58.198817 (XEN)
Mar 11 13:46:58.198828 (XEN)
Mar 11 13:46:58.198839 (XEN) ****************************************
Mar 11 13:46:58.207765 (XEN) Panic on CPU 1:
Mar 11 13:46:58.207787 (XEN) FATAL TRAP: vector = 2 (nmi)
Mar 11 13:46:58.207813 (XEN) [error_code=0000] , IN INTERRUPT CONTEXT
Mar 11 13:46:58.218761 (XEN) ****************************************
Mar 11 13:46:58.218788 (XEN)
Mar 11 13:46:58.218802 (XEN) Reboot in five seconds...

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel
Re: [xen-unstable test] 6374: regressions - FAIL [ In reply to ]
At 17:51 +0000 on 11 Mar (1299865912), Ian Jackson wrote:
> Mar 11 13:46:58.154777 (XEN) Xen call trace:
> Mar 11 13:46:58.154798 (XEN) [<ffff82c480100140>] __bitmap_empty+0x0/0x7f
> Mar 11 13:46:58.163767 (XEN) [<ffff82c480119582>] csched_cpu_pick+0xe/0x10
> Mar 11 13:46:58.163802 (XEN) [<ffff82c480122c8d>] vcpu_migrate+0xfb/0x230
> Mar 11 13:46:58.178768 (XEN) [<ffff82c480122e24>] context_saved+0x62/0x7b
> Mar 11 13:46:58.178799 (XEN) [<ffff82c480157f17>] context_switch+0xd98/0xdca
> Mar 11 13:46:58.183766 (XEN) [<ffff82c4801226b4>] schedule+0x5fc/0x624
> Mar 11 13:46:58.183795 (XEN) [<ffff82c480123837>] __do_softirq+0x88/0x99
> Mar 11 13:46:58.198784 (XEN) [<ffff82c4801238b2>] do_softirq+0x6a/0x7a

I think this hang comes because although this code:

cpu = cycle_cpu(CSCHED_PCPU(nxt)->idle_bias, nxt_idlers);
if ( commit )
CSCHED_PCPU(nxt)->idle_bias = cpu;
cpus_andnot(cpus, cpus, per_cpu(cpu_sibling_map, cpu));

removes the new cpu and its siblings from cpus, cpu isn't guaranteed to
have been in cpus in the first place, and none of its siblings are
either since nxt might not be its sibling.

Possible fix:

diff -r b9a5d116102d xen/common/sched_credit.c
--- a/xen/common/sched_credit.c Thu Mar 10 13:06:52 2011 +0000
+++ b/xen/common/sched_credit.c Mon Mar 14 09:25:07 2011 +0000
@@ -533,7 +533,7 @@ _csched_cpu_pick(const struct scheduler
cpu = cycle_cpu(CSCHED_PCPU(nxt)->idle_bias, nxt_idlers);
if ( commit )
CSCHED_PCPU(nxt)->idle_bias = cpu;
- cpus_andnot(cpus, cpus, per_cpu(cpu_sibling_map, cpu));
+ cpus_andnot(cpus, cpus, nxt_idlers);
}
else
{

which guarantees that nxt will be removed from cpus, though I suspect
this means that we might not pick the best HT pair in a particular core.
Scheduler code is twisty and hurts my brain so I'd like George's
opinion before checking anything in.

Cheers,

Tim.

P.S. the patch above is a one-liner for clarity: a better fix would be:

diff -r b9a5d116102d xen/common/sched_credit.c
--- a/xen/common/sched_credit.c Thu Mar 10 13:06:52 2011 +0000
+++ b/xen/common/sched_credit.c Mon Mar 14 09:26:11 2011 +0000
@@ -533,12 +533,8 @@ _csched_cpu_pick(const struct scheduler
cpu = cycle_cpu(CSCHED_PCPU(nxt)->idle_bias, nxt_idlers);
if ( commit )
CSCHED_PCPU(nxt)->idle_bias = cpu;
- cpus_andnot(cpus, cpus, per_cpu(cpu_sibling_map, cpu));
}
- else
- {
- cpus_andnot(cpus, cpus, nxt_idlers);
- }
+ cpus_andnot(cpus, cpus, nxt_idlers);
}

return cpu;



--
Tim Deegan <Tim.Deegan@citrix.com>
Principal Software Engineer, Xen Platform Team
Citrix Systems UK Ltd. (Company #02937203, SL9 0BG)

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel
Re: [xen-unstable test] 6374: regressions - FAIL [ In reply to ]
>>> On 11.03.11 at 18:51, Ian Jackson <Ian.Jackson@eu.citrix.com> wrote:
> xen.org writes ("[Xen-devel] [xen-unstable test] 6374: regressions - FAIL"):
>> flight 6374 xen-unstable real [real]
>> Tests which did not succeed and are blocking:
>> test-amd64-i386-pv 5 xen-boot fail REGR. vs. 6369
>
> Xen crash in scheduler (non-credit2).
>
> Mar 11 13:46:53.646796 (XEN) Watchdog timer detects that CPU1 is stuck!
> Mar 11 13:46:57.922794 (XEN) ----[ Xen-4.1.0-rc7-pre x86_64 debug=y Not tainted ]----
> Mar 11 13:46:57.931763 (XEN) CPU: 1
> Mar 11 13:46:57.931784 (XEN) RIP: e008:[<ffff82c480100140>] __bitmap_empty+0x0/0x7f
> Mar 11 13:46:57.931817 (XEN) RFLAGS: 0000000000000047 CONTEXT: hypervisor
> Mar 11 13:46:57.946773 (XEN) rax: ffff82c4802d1ac0 rbx: ffff8301a7fafc78 rcx: 0000000000000002
> Mar 11 13:46:57.946813 (XEN) rdx: ffff82c4802d0cc0 rsi: 0000000000000080 rdi: ffff8301a7fafc78
> Mar 11 13:46:57.954780 (XEN) rbp: ffff8301a7fafcb8 rsp: ffff8301a7fafc00 r8: 0000000000000002
> Mar 11 13:46:57.966770 (XEN) r9: 0000ffff0000ffff r10: 00ff00ff00ff00ff r11: 0f0f0f0f0f0f0f0f
> Mar 11 13:46:57.966805 (XEN) r12: ffff8301a7fafc68 r13: 0000000000000001 r14: 0000000000000001
> Mar 11 13:46:57.975780 (XEN) r15: ffff82c4802d1ac0 cr0: 000000008005003b cr4: 00000000000006f0
> Mar 11 13:46:57.987771 (XEN) cr3: 00000000d7c9c000 cr2: 00000000c45e5770
> Mar 11 13:46:57.987800 (XEN) ds: 007b es: 007b fs: 00d8 gs: 0033 ss: 0000 cs: e008
> Mar 11 13:46:57.998773 (XEN) Xen stack trace from rsp=ffff8301a7fafc00:
>...
> Mar 11 13:46:58.154777 (XEN) Xen call trace:
> Mar 11 13:46:58.154798 (XEN) [<ffff82c480100140>] __bitmap_empty+0x0/0x7f
> Mar 11 13:46:58.163767 (XEN) [<ffff82c480119582>] csched_cpu_pick+0xe/0x10
> Mar 11 13:46:58.163802 (XEN) [<ffff82c480122c8d>] vcpu_migrate+0xfb/0x230
> Mar 11 13:46:58.178768 (XEN) [<ffff82c480122e24>] context_saved+0x62/0x7b
> Mar 11 13:46:58.178799 (XEN) [<ffff82c480157f17>] context_switch+0xd98/0xdca
> Mar 11 13:46:58.183766 (XEN) [<ffff82c4801226b4>] schedule+0x5fc/0x624
> Mar 11 13:46:58.183795 (XEN) [<ffff82c480123837>] __do_softirq+0x88/0x99
> Mar 11 13:46:58.198784 (XEN) [<ffff82c4801238b2>] do_softirq+0x6a/0x7a

I suppose that's a result of 22957:c5c4688d5654 - as I understand it
exiting the loop is only possible if two consecutive invocations of
pick_cpu return the same result. This, however, is precisely what the
pCPU's idle_bias is supposed to prevent on hyper-threaded/multi-core
systems (so that it's not always the same entity that gets selected).

But even beyond that particular aspect, relying on any form of
"stability" of the returned value isn't correct.

Plus running pick_cpu repeatedly without actually using its result
is wrong wrt to idle_bias updating too - that's why
cached_vcpu_acct() calls _csched_cpu_pick() with the commit
argument set to false (which will result in a subsequent call -
through pick_cpu - with the argument set to true to be likely
to return the same value, but there's no correctness dependency
on that). So 22948:2d35823a86e7 already wasn't really correct
in putting a loop around pick_cpu.

It's also not clear to me what the surrounding
if ( old_lock == per_cpu(schedule_data, old_cpu).schedule_lock )
is supposed to filter, as the lock pointer gets set only when a
CPU gets brought up.

As I don't really understand what is being tried to achieve here,
I also can't really suggest a possible fix other than reverting both
offending changesets.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel
Re: [xen-unstable test] 6374: regressions - FAIL [ In reply to ]
>>> On 14.03.11 at 11:02, Tim Deegan <Tim.Deegan@citrix.com> wrote:
> At 17:51 +0000 on 11 Mar (1299865912), Ian Jackson wrote:
>> Mar 11 13:46:58.154777 (XEN) Xen call trace:
>> Mar 11 13:46:58.154798 (XEN) [<ffff82c480100140>] __bitmap_empty+0x0/0x7f
>> Mar 11 13:46:58.163767 (XEN) [<ffff82c480119582>] csched_cpu_pick+0xe/0x10
>> Mar 11 13:46:58.163802 (XEN) [<ffff82c480122c8d>] vcpu_migrate+0xfb/0x230
>> Mar 11 13:46:58.178768 (XEN) [<ffff82c480122e24>] context_saved+0x62/0x7b
>> Mar 11 13:46:58.178799 (XEN) [<ffff82c480157f17>]
> context_switch+0xd98/0xdca
>> Mar 11 13:46:58.183766 (XEN) [<ffff82c4801226b4>] schedule+0x5fc/0x624
>> Mar 11 13:46:58.183795 (XEN) [<ffff82c480123837>] __do_softirq+0x88/0x99
>> Mar 11 13:46:58.198784 (XEN) [<ffff82c4801238b2>] do_softirq+0x6a/0x7a
>
> I think this hang comes because although this code:
>
> cpu = cycle_cpu(CSCHED_PCPU(nxt)->idle_bias, nxt_idlers);
> if ( commit )
> CSCHED_PCPU(nxt)->idle_bias = cpu;
> cpus_andnot(cpus, cpus, per_cpu(cpu_sibling_map, cpu));
>
> removes the new cpu and its siblings from cpus, cpu isn't guaranteed to
> have been in cpus in the first place, and none of its siblings are
> either since nxt might not be its sibling.

I had originally spent quite a while to verify that the loop this is in
can't be infinite (i.e. there's going to be always at least one bit
removed from "cpus"), and did so again during the last half hour
or so. I'm certain (hardened also by the CPU masks we see on the
stack) that it's not this function itself that's looping infinitely, but
rather its caller (see my other reply sent just a few minutes ago).

> Possible fix:
>
> diff -r b9a5d116102d xen/common/sched_credit.c
> --- a/xen/common/sched_credit.c Thu Mar 10 13:06:52 2011 +0000
> +++ b/xen/common/sched_credit.c Mon Mar 14 09:25:07 2011 +0000
> @@ -533,7 +533,7 @@ _csched_cpu_pick(const struct scheduler
> cpu = cycle_cpu(CSCHED_PCPU(nxt)->idle_bias, nxt_idlers);
> if ( commit )
> CSCHED_PCPU(nxt)->idle_bias = cpu;
> - cpus_andnot(cpus, cpus, per_cpu(cpu_sibling_map, cpu));
> + cpus_andnot(cpus, cpus, nxt_idlers);
> }
> else
> {
>
> which guarantees that nxt will be removed from cpus, though I suspect
> this means that we might not pick the best HT pair in a particular core.
> Scheduler code is twisty and hurts my brain so I'd like George's
> opinion before checking anything in.

No - that was precisely done the opposite direction to get
better symmetry of load across all CPUs. With what you propose,
idle_bias would become meaningless.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel
Re: [xen-unstable test] 6374: regressions - FAIL [ In reply to ]
At 10:39 +0000 on 14 Mar (1300099174), Jan Beulich wrote:
> > I think this hang comes because although this code:
> >
> > cpu = cycle_cpu(CSCHED_PCPU(nxt)->idle_bias, nxt_idlers);
> > if ( commit )
> > CSCHED_PCPU(nxt)->idle_bias = cpu;
> > cpus_andnot(cpus, cpus, per_cpu(cpu_sibling_map, cpu));
> >
> > removes the new cpu and its siblings from cpus, cpu isn't guaranteed to
> > have been in cpus in the first place, and none of its siblings are
> > either since nxt might not be its sibling.
>
> I had originally spent quite a while to verify that the loop this is in
> can't be infinite (i.e. there's going to be always at least one bit
> removed from "cpus"), and did so again during the last half hour
> or so.

I'm pretty sure there are possible passes through this loop that don't
remove any cpus, though I haven't constructed the full history that gets
you there. But the cpupool patches you suggest in your other email look
like much stronger candidates for this hang.

> > which guarantees that nxt will be removed from cpus, though I suspect
> > this means that we might not pick the best HT pair in a particular core.
> > Scheduler code is twisty and hurts my brain so I'd like George's
> > opinion before checking anything in.
>
> No - that was precisely done the opposite direction to get
> better symmetry of load across all CPUs. With what you propose,
> idle_bias would become meaningless.

I don't think see why it would. As I said, having picked a core we
might not iterate to pick the best cpu within that core, but the
round-robining effect is still there. And even if not I figured a
hypervisor crash is worse than a suboptimal scheduling decision. :)

Tim.

--
Tim Deegan <Tim.Deegan@citrix.com>
Principal Software Engineer, Xen Platform Team
Citrix Systems UK Ltd. (Company #02937203, SL9 0BG)

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel
Re: [xen-unstable test] 6374: regressions - FAIL [ In reply to ]
On 03/14/11 11:33, Jan Beulich wrote:
>>>> On 11.03.11 at 18:51, Ian Jackson<Ian.Jackson@eu.citrix.com> wrote:
>> xen.org writes ("[Xen-devel] [xen-unstable test] 6374: regressions - FAIL"):
>>> flight 6374 xen-unstable real [real]
>>> Tests which did not succeed and are blocking:
>>> test-amd64-i386-pv 5 xen-boot fail REGR. vs. 6369
>>
>> Xen crash in scheduler (non-credit2).
>>
>> Mar 11 13:46:53.646796 (XEN) Watchdog timer detects that CPU1 is stuck!
>> Mar 11 13:46:57.922794 (XEN) ----[ Xen-4.1.0-rc7-pre x86_64 debug=y Not tainted ]----
>> Mar 11 13:46:57.931763 (XEN) CPU: 1
>> Mar 11 13:46:57.931784 (XEN) RIP: e008:[<ffff82c480100140>] __bitmap_empty+0x0/0x7f
>> Mar 11 13:46:57.931817 (XEN) RFLAGS: 0000000000000047 CONTEXT: hypervisor
>> Mar 11 13:46:57.946773 (XEN) rax: ffff82c4802d1ac0 rbx: ffff8301a7fafc78 rcx: 0000000000000002
>> Mar 11 13:46:57.946813 (XEN) rdx: ffff82c4802d0cc0 rsi: 0000000000000080 rdi: ffff8301a7fafc78
>> Mar 11 13:46:57.954780 (XEN) rbp: ffff8301a7fafcb8 rsp: ffff8301a7fafc00 r8: 0000000000000002
>> Mar 11 13:46:57.966770 (XEN) r9: 0000ffff0000ffff r10: 00ff00ff00ff00ff r11: 0f0f0f0f0f0f0f0f
>> Mar 11 13:46:57.966805 (XEN) r12: ffff8301a7fafc68 r13: 0000000000000001 r14: 0000000000000001
>> Mar 11 13:46:57.975780 (XEN) r15: ffff82c4802d1ac0 cr0: 000000008005003b cr4: 00000000000006f0
>> Mar 11 13:46:57.987771 (XEN) cr3: 00000000d7c9c000 cr2: 00000000c45e5770
>> Mar 11 13:46:57.987800 (XEN) ds: 007b es: 007b fs: 00d8 gs: 0033 ss: 0000 cs: e008
>> Mar 11 13:46:57.998773 (XEN) Xen stack trace from rsp=ffff8301a7fafc00:
>> ...
>> Mar 11 13:46:58.154777 (XEN) Xen call trace:
>> Mar 11 13:46:58.154798 (XEN) [<ffff82c480100140>] __bitmap_empty+0x0/0x7f
>> Mar 11 13:46:58.163767 (XEN) [<ffff82c480119582>] csched_cpu_pick+0xe/0x10
>> Mar 11 13:46:58.163802 (XEN) [<ffff82c480122c8d>] vcpu_migrate+0xfb/0x230
>> Mar 11 13:46:58.178768 (XEN) [<ffff82c480122e24>] context_saved+0x62/0x7b
>> Mar 11 13:46:58.178799 (XEN) [<ffff82c480157f17>] context_switch+0xd98/0xdca
>> Mar 11 13:46:58.183766 (XEN) [<ffff82c4801226b4>] schedule+0x5fc/0x624
>> Mar 11 13:46:58.183795 (XEN) [<ffff82c480123837>] __do_softirq+0x88/0x99
>> Mar 11 13:46:58.198784 (XEN) [<ffff82c4801238b2>] do_softirq+0x6a/0x7a
>
> I suppose that's a result of 22957:c5c4688d5654 - as I understand it
> exiting the loop is only possible if two consecutive invocations of
> pick_cpu return the same result. This, however, is precisely what the
> pCPU's idle_bias is supposed to prevent on hyper-threaded/multi-core
> systems (so that it's not always the same entity that gets selected).
>
> But even beyond that particular aspect, relying on any form of
> "stability" of the returned value isn't correct.
>
> Plus running pick_cpu repeatedly without actually using its result
> is wrong wrt to idle_bias updating too - that's why
> cached_vcpu_acct() calls _csched_cpu_pick() with the commit
> argument set to false (which will result in a subsequent call -
> through pick_cpu - with the argument set to true to be likely
> to return the same value, but there's no correctness dependency
> on that). So 22948:2d35823a86e7 already wasn't really correct
> in putting a loop around pick_cpu.
>
> It's also not clear to me what the surrounding
> if ( old_lock == per_cpu(schedule_data, old_cpu).schedule_lock )
> is supposed to filter, as the lock pointer gets set only when a
> CPU gets brought up.

Yeah, but the vcpu can change cpus while we don't hold the lock.
This means old_cpu can change between selecting the lock and actually
taking it...

> As I don't really understand what is being tried to achieve here,
> I also can't really suggest a possible fix other than reverting both
> offending changesets.

I'll send a patch as a suggestion :-)


Juergen

--
Juergen Gross Principal Developer Operating Systems
TSP ES&S SWE OS6 Telephone: +49 (0) 89 3222 2967
Fujitsu Technology Solutions e-mail: juergen.gross@ts.fujitsu.com
Domagkstr. 28 Internet: ts.fujitsu.com
D-80807 Muenchen Company details: ts.fujitsu.com/imprint.html

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel
Re: [xen-unstable test] 6374: regressions - FAIL [ In reply to ]
>>> On 14.03.11 at 11:52, Tim Deegan <Tim.Deegan@citrix.com> wrote:
> At 10:39 +0000 on 14 Mar (1300099174), Jan Beulich wrote:
>> > I think this hang comes because although this code:
>> >
>> > cpu = cycle_cpu(CSCHED_PCPU(nxt)->idle_bias, nxt_idlers);
>> > if ( commit )
>> > CSCHED_PCPU(nxt)->idle_bias = cpu;
>> > cpus_andnot(cpus, cpus, per_cpu(cpu_sibling_map, cpu));
>> >
>> > removes the new cpu and its siblings from cpus, cpu isn't guaranteed to
>> > have been in cpus in the first place, and none of its siblings are
>> > either since nxt might not be its sibling.
>>
>> I had originally spent quite a while to verify that the loop this is in
>> can't be infinite (i.e. there's going to be always at least one bit
>> removed from "cpus"), and did so again during the last half hour
>> or so.
>
> I'm pretty sure there are possible passes through this loop that don't
> remove any cpus, though I haven't constructed the full history that gets
> you there.

Actually, while I don't think that this can happen, something else is
definitely broken here: The logic can select a CPU that's not in the
vCPU's affinity mask. How I managed to not note this when I
originally put this change together I can't tell. I'll send a patch in
a moment, and I think after that patch it's also easier to see that
each iteration will remove at least one bit.

>> > which guarantees that nxt will be removed from cpus, though I suspect
>> > this means that we might not pick the best HT pair in a particular core.
>> > Scheduler code is twisty and hurts my brain so I'd like George's
>> > opinion before checking anything in.
>>
>> No - that was precisely done the opposite direction to get
>> better symmetry of load across all CPUs. With what you propose,
>> idle_bias would become meaningless.
>
> I don't think see why it would. As I said, having picked a core we
> might not iterate to pick the best cpu within that core, but the
> round-robining effect is still there. And even if not I figured a
> hypervisor crash is worse than a suboptimal scheduling decision. :)

Sure. Just that this code has been there for quite a long time, and
it would be really strange to only now see it start producing hangs
(which apparently aren't that difficult to reproduce - iirc a similar
one was sent around by Ian a few days earlier).

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel
Re: [xen-unstable test] 6374: regressions - FAIL [ In reply to ]
At 16:08 +0000 on 14 Mar (1300118917), Jan Beulich wrote:
> Actually, while I don't think that this can happen, something else is
> definitely broken here: The logic can select a CPU that's not in the
> vCPU's affinity mask. How I managed to not note this when I
> originally put this change together I can't tell. I'll send a patch in
> a moment, and I think after that patch it's also easier to see that
> each iteration will remove at least one bit.

Yes, as long as the cpu selected has to be in "cpus", the loop is
definitely safe.

> Sure. Just that this code has been there for quite a long time, and
> it would be really strange to only now see it start producing hangs
> (which apparently aren't that difficult to reproduce - iirc a similar
> one was sent around by Ian a few days earlier).

Agreed; the other branch of this thread is clerly where this particular
hang is coming from.

Cheers,

Tim.

--
Tim Deegan <Tim.Deegan@citrix.com>
Principal Software Engineer, Xen Platform Team
Citrix Systems UK Ltd. (Company #02937203, SL9 0BG)

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel