Mailing List Archive

[xen-unstable test] 164996: regressions - FAIL
flight 164996 xen-unstable real [real]
flight 165002 xen-unstable real-retest [real]
http://logs.test-lab.xenproject.org/osstest/logs/164996/
http://logs.test-lab.xenproject.org/osstest/logs/165002/

Regressions :-(

Tests which did not succeed and are blocking,
including tests which could not be run:
test-arm64-arm64-libvirt-raw 17 guest-start/debian.repeat fail REGR. vs. 164945

Tests which are failing intermittently (not blocking):
test-amd64-amd64-xl-qemut-debianhvm-i386-xsm 12 debian-hvm-install fail pass in 165002-retest
test-amd64-i386-xl-qemut-debianhvm-i386-xsm 12 debian-hvm-install fail pass in 165002-retest

Tests which did not succeed, but are not blocking:
test-amd64-amd64-xl-qemut-win7-amd64 19 guest-stop fail like 164945
test-armhf-armhf-libvirt 16 saverestore-support-check fail like 164945
test-amd64-amd64-qemuu-nested-amd 20 debian-hvm-install/l1/l2 fail like 164945
test-amd64-amd64-xl-qemuu-ws16-amd64 19 guest-stop fail like 164945
test-amd64-i386-xl-qemut-ws16-amd64 19 guest-stop fail like 164945
test-armhf-armhf-libvirt-qcow2 15 saverestore-support-check fail like 164945
test-amd64-i386-xl-qemut-win7-amd64 19 guest-stop fail like 164945
test-armhf-armhf-libvirt-raw 15 saverestore-support-check fail like 164945
test-amd64-i386-xl-qemuu-win7-amd64 19 guest-stop fail like 164945
test-amd64-amd64-xl-qemut-ws16-amd64 19 guest-stop fail like 164945
test-amd64-amd64-xl-qemuu-win7-amd64 19 guest-stop fail like 164945
test-amd64-i386-xl-qemuu-ws16-amd64 19 guest-stop fail like 164945
test-arm64-arm64-xl-seattle 15 migrate-support-check fail never pass
test-arm64-arm64-xl-seattle 16 saverestore-support-check fail never pass
test-amd64-amd64-libvirt-xsm 15 migrate-support-check fail never pass
test-amd64-i386-libvirt 15 migrate-support-check fail never pass
test-amd64-i386-libvirt-xsm 15 migrate-support-check fail never pass
test-amd64-amd64-libvirt 15 migrate-support-check fail never pass
test-amd64-i386-xl-pvshim 14 guest-start fail never pass
test-amd64-i386-libvirt-qemuu-debianhvm-amd64-xsm 13 migrate-support-check fail never pass
test-arm64-arm64-xl-xsm 15 migrate-support-check fail never pass
test-arm64-arm64-xl-xsm 16 saverestore-support-check fail never pass
test-arm64-arm64-xl-credit2 15 migrate-support-check fail never pass
test-arm64-arm64-xl-credit2 16 saverestore-support-check fail never pass
test-arm64-arm64-xl-thunderx 15 migrate-support-check fail never pass
test-arm64-arm64-xl-thunderx 16 saverestore-support-check fail never pass
test-arm64-arm64-libvirt-xsm 15 migrate-support-check fail never pass
test-arm64-arm64-libvirt-xsm 16 saverestore-support-check fail never pass
test-arm64-arm64-xl 15 migrate-support-check fail never pass
test-arm64-arm64-xl 16 saverestore-support-check fail never pass
test-amd64-amd64-libvirt-qemuu-debianhvm-amd64-xsm 13 migrate-support-check fail never pass
test-amd64-i386-libvirt-raw 14 migrate-support-check fail never pass
test-arm64-arm64-libvirt-raw 14 migrate-support-check fail never pass
test-arm64-arm64-libvirt-raw 15 saverestore-support-check fail never pass
test-amd64-amd64-libvirt-vhd 14 migrate-support-check fail never pass
test-arm64-arm64-xl-vhd 14 migrate-support-check fail never pass
test-arm64-arm64-xl-vhd 15 saverestore-support-check fail never pass
test-armhf-armhf-xl-multivcpu 15 migrate-support-check fail never pass
test-armhf-armhf-xl-multivcpu 16 saverestore-support-check fail never pass
test-armhf-armhf-xl 15 migrate-support-check fail never pass
test-armhf-armhf-xl 16 saverestore-support-check fail never pass
test-armhf-armhf-libvirt 15 migrate-support-check fail never pass
test-armhf-armhf-xl-credit1 15 migrate-support-check fail never pass
test-armhf-armhf-xl-credit1 16 saverestore-support-check fail never pass
test-armhf-armhf-xl-cubietruck 15 migrate-support-check fail never pass
test-armhf-armhf-xl-cubietruck 16 saverestore-support-check fail never pass
test-armhf-armhf-libvirt-qcow2 14 migrate-support-check fail never pass
test-armhf-armhf-libvirt-raw 14 migrate-support-check fail never pass
test-armhf-armhf-xl-vhd 14 migrate-support-check fail never pass
test-armhf-armhf-xl-vhd 15 saverestore-support-check fail never pass
test-armhf-armhf-xl-rtds 15 migrate-support-check fail never pass
test-armhf-armhf-xl-rtds 16 saverestore-support-check fail never pass
test-armhf-armhf-xl-credit2 15 migrate-support-check fail never pass
test-armhf-armhf-xl-credit2 16 saverestore-support-check fail never pass
test-armhf-armhf-xl-arndale 15 migrate-support-check fail never pass
test-armhf-armhf-xl-arndale 16 saverestore-support-check fail never pass
test-arm64-arm64-xl-credit1 15 migrate-support-check fail never pass
test-arm64-arm64-xl-credit1 16 saverestore-support-check fail never pass

version targeted for testing:
xen 487975df53b5298316b594550c79934d646701bd
baseline version:
xen c76cfada1cfad05aaf64ce3ad305c5467650e782

Last test of basis 164945 2021-09-10 21:23:48 Z 5 days
Failing since 164951 2021-09-12 00:14:36 Z 4 days 8 attempts
Testing same since 164996 2021-09-15 11:47:08 Z 0 days 1 attempts

------------------------------------------------------------
People who touched revisions under test:
Andrew Cooper <andrew.cooper3@citrix.com>
Daniel P. Smith <dpsmith@apertussolutions.com>
Ian Jackson <iwj@xenproject.org>
Jan Beulich <jbeulich@suse.com>
Nick Rosbrook <rosbrookn@ainfosec.com>
Penny Zheng <penny.zheng@arm.com>
Roger Pau Monne <roger.pau@citrix.com>
Roger Pau Monné <roger.pau@citrix.com>
Stefano Stabellini <stefano.stabellini@xilinx.com>

jobs:
build-amd64-xsm pass
build-arm64-xsm pass
build-i386-xsm pass
build-amd64-xtf pass
build-amd64 pass
build-arm64 pass
build-armhf pass
build-i386 pass
build-amd64-libvirt pass
build-arm64-libvirt pass
build-armhf-libvirt pass
build-i386-libvirt pass
build-amd64-prev pass
build-i386-prev pass
build-amd64-pvops pass
build-arm64-pvops pass
build-armhf-pvops pass
build-i386-pvops pass
test-xtf-amd64-amd64-1 pass
test-xtf-amd64-amd64-2 pass
test-xtf-amd64-amd64-3 pass
test-xtf-amd64-amd64-4 pass
test-xtf-amd64-amd64-5 pass
test-amd64-amd64-xl pass
test-amd64-coresched-amd64-xl pass
test-arm64-arm64-xl pass
test-armhf-armhf-xl pass
test-amd64-i386-xl pass
test-amd64-coresched-i386-xl pass
test-amd64-amd64-libvirt-qemuu-debianhvm-amd64-xsm pass
test-amd64-i386-libvirt-qemuu-debianhvm-amd64-xsm pass
test-amd64-amd64-xl-qemut-stubdom-debianhvm-amd64-xsm pass
test-amd64-i386-xl-qemut-stubdom-debianhvm-amd64-xsm pass
test-amd64-amd64-xl-qemut-debianhvm-i386-xsm fail
test-amd64-i386-xl-qemut-debianhvm-i386-xsm fail
test-amd64-amd64-xl-qemuu-debianhvm-i386-xsm pass
test-amd64-i386-xl-qemuu-debianhvm-i386-xsm pass
test-amd64-amd64-libvirt-xsm pass
test-arm64-arm64-libvirt-xsm pass
test-amd64-i386-libvirt-xsm pass
test-amd64-amd64-xl-xsm pass
test-arm64-arm64-xl-xsm pass
test-amd64-i386-xl-xsm pass
test-amd64-amd64-qemuu-nested-amd fail
test-amd64-amd64-xl-pvhv2-amd pass
test-amd64-i386-qemut-rhel6hvm-amd pass
test-amd64-i386-qemuu-rhel6hvm-amd pass
test-amd64-amd64-dom0pvh-xl-amd pass
test-amd64-amd64-xl-qemut-debianhvm-amd64 pass
test-amd64-i386-xl-qemut-debianhvm-amd64 pass
test-amd64-amd64-xl-qemuu-debianhvm-amd64 pass
test-amd64-i386-xl-qemuu-debianhvm-amd64 pass
test-amd64-i386-freebsd10-amd64 pass
test-amd64-amd64-qemuu-freebsd11-amd64 pass
test-amd64-amd64-qemuu-freebsd12-amd64 pass
test-amd64-amd64-xl-qemuu-ovmf-amd64 pass
test-amd64-i386-xl-qemuu-ovmf-amd64 pass
test-amd64-amd64-xl-qemut-win7-amd64 fail
test-amd64-i386-xl-qemut-win7-amd64 fail
test-amd64-amd64-xl-qemuu-win7-amd64 fail
test-amd64-i386-xl-qemuu-win7-amd64 fail
test-amd64-amd64-xl-qemut-ws16-amd64 fail
test-amd64-i386-xl-qemut-ws16-amd64 fail
test-amd64-amd64-xl-qemuu-ws16-amd64 fail
test-amd64-i386-xl-qemuu-ws16-amd64 fail
test-armhf-armhf-xl-arndale pass
test-amd64-amd64-xl-credit1 pass
test-arm64-arm64-xl-credit1 pass
test-armhf-armhf-xl-credit1 pass
test-amd64-amd64-xl-credit2 pass
test-arm64-arm64-xl-credit2 pass
test-armhf-armhf-xl-credit2 pass
test-armhf-armhf-xl-cubietruck pass
test-amd64-amd64-xl-qemuu-dmrestrict-amd64-dmrestrict pass
test-amd64-i386-xl-qemuu-dmrestrict-amd64-dmrestrict pass
test-amd64-amd64-examine pass
test-arm64-arm64-examine pass
test-armhf-armhf-examine pass
test-amd64-i386-examine pass
test-amd64-i386-freebsd10-i386 pass
test-amd64-amd64-qemuu-nested-intel pass
test-amd64-amd64-xl-pvhv2-intel pass
test-amd64-i386-qemut-rhel6hvm-intel pass
test-amd64-i386-qemuu-rhel6hvm-intel pass
test-amd64-amd64-dom0pvh-xl-intel pass
test-amd64-amd64-libvirt pass
test-armhf-armhf-libvirt pass
test-amd64-i386-libvirt pass
test-amd64-amd64-livepatch pass
test-amd64-i386-livepatch pass
test-amd64-amd64-migrupgrade pass
test-amd64-i386-migrupgrade pass
test-amd64-amd64-xl-multivcpu pass
test-armhf-armhf-xl-multivcpu pass
test-amd64-amd64-pair pass
test-amd64-i386-pair pass
test-amd64-amd64-libvirt-pair pass
test-amd64-i386-libvirt-pair pass
test-amd64-amd64-xl-pvshim pass
test-amd64-i386-xl-pvshim fail
test-amd64-amd64-pygrub pass
test-armhf-armhf-libvirt-qcow2 pass
test-amd64-amd64-xl-qcow2 pass
test-arm64-arm64-libvirt-raw fail
test-armhf-armhf-libvirt-raw pass
test-amd64-i386-libvirt-raw pass
test-amd64-amd64-xl-rtds pass
test-armhf-armhf-xl-rtds pass
test-arm64-arm64-xl-seattle pass
test-amd64-amd64-xl-qemuu-debianhvm-amd64-shadow pass
test-amd64-i386-xl-qemuu-debianhvm-amd64-shadow pass
test-amd64-amd64-xl-shadow pass
test-amd64-i386-xl-shadow pass
test-arm64-arm64-xl-thunderx pass
test-amd64-amd64-libvirt-vhd pass
test-arm64-arm64-xl-vhd pass
test-armhf-armhf-xl-vhd pass
test-amd64-i386-xl-vhd pass


------------------------------------------------------------
sg-report-flight on osstest.test-lab.xenproject.org
logs: /home/logs/logs
images: /home/logs/images

Logs, config files, etc. are available at
http://logs.test-lab.xenproject.org/osstest/logs

Explanation of these reports, and of osstest in general, is at
http://xenbits.xen.org/gitweb/?p=osstest.git;a=blob;f=README.email;hb=master
http://xenbits.xen.org/gitweb/?p=osstest.git;a=blob;f=README;hb=master

Test harness code can be found at
http://xenbits.xen.org/gitweb?p=osstest.git;a=summary


Not pushing.

------------------------------------------------------------
commit 487975df53b5298316b594550c79934d646701bd
Author: Penny Zheng <penny.zheng@arm.com>
Date: Fri Sep 10 02:52:15 2021 +0000

xen/arm: introduce allocate_static_memory

This commit introduces a new function allocate_static_memory to allocate
static memory as guest RAM for domains on Static Allocation.

It uses acquire_domstatic_pages to acquire pre-configured static memory
for the domain, and uses guest_physmap_add_pages to set up the P2M table.
These pre-defined static memory banks shall be mapped to the usual guest
memory addresses (GUEST_RAM0_BASE, GUEST_RAM1_BASE) defined by
xen/include/public/arch-arm.h.

In order to deal with the trouble of count-to-order conversion when page number
is not in a power-of-two, this commit exports p2m_insert_mapping and introduce
a new function guest_physmap_add_pages to cope with adding guest RAM p2m
mapping with nr_pages.

Signed-off-by: Penny Zheng <penny.zheng@arm.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>

commit c7fe462c0d274ffa30c9448c0a80affa075d789d
Author: Penny Zheng <penny.zheng@arm.com>
Date: Fri Sep 10 02:52:14 2021 +0000

xen/arm: introduce acquire_staticmem_pages and acquire_domstatic_pages

New function acquire_staticmem_pages aims to acquire nr_mfns contiguous pages
of static memory, starting at #smfn. And it is the equivalent of
alloc_heap_pages for static memory.

For each page, it shall check if the page is reserved(PGC_reserved)
and free. It shall also do a set of necessary initialization, which are
mostly the same ones in alloc_heap_pages, like, following the same
cache-coherency policy and turning page status into PGC_state_inuse, etc.

New function acquire_domstatic_pages is the equivalent of alloc_domheap_pages
for static memory, and it is to acquire nr_mfns contiguous pages of
static memory and assign them to one specific domain.

It uses acquire_staticmem_pages to acquire nr_mfns pages of static memory.
Then on success, it will use assign_pages to assign those pages to one
specific domain.

In order to differentiate pages of static memory from those allocated from
heap, this patch introduces a new page flag PGC_reserved, then mark pages of
static memory PGC_reserved when initializing them.

Signed-off-by: Penny Zheng <penny.zheng@arm.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>

commit 5260e8fb93f0e1f094de4142b2abad45844ab89c
Author: Penny Zheng <penny.zheng@arm.com>
Date: Fri Sep 10 02:52:13 2021 +0000

xen: re-define assign_pages and introduce a new function assign_page

In order to deal with the trouble of count-to-order conversion when page number
is not in a power-of-two, this commit re-define assign_pages for nr pages and
assign_page for original page with a single order.

Backporting confusion could be helped by altering the order of assign_pages
parameters, such that the compiler would point out that adjustments at call
sites are needed.

[stefano: switch to unsigned int for nr]
Signed-off-by: Penny Zheng <penny.zheng@arm.com>
Signed-off-by: Stefano Stabellini <stefano.stabellini@xilinx.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>

commit 4a9e73e6e53e9d8bc005a08c3968ec36d793f140
Author: Penny Zheng <penny.zheng@arm.com>
Date: Fri Sep 10 02:52:12 2021 +0000

xen/arm: static memory initialization

This patch introduces static memory initialization, during system boot-up.

The new function init_staticmem_pages is responsible for static memory
initialization.

Helper free_staticmem_pages is the equivalent of free_heap_pages, to free
nr_mfns pages of static memory.

This commit also introduces a new CONFIG_STATIC_MEMORY option to wrap all
static-allocation-related code.

Put asynchronously scrubbing pages of static memory in TODO list.

Signed-off-by: Penny Zheng <penny.zheng@arm.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>

commit 540a637c3410780b519fc055f432afe271f642f8
Author: Penny Zheng <penny.zheng@arm.com>
Date: Fri Sep 10 02:52:11 2021 +0000

xen: introduce mark_page_free

This commit defines a new helper mark_page_free to extract common code,
like following the same cache/TLB coherency policy, between free_heap_pages
and the new function free_staticmem_pages, which will be introduced later.

The PDX compression makes that conversion between the MFN and the page can
be potentially non-trivial. As the function is internal, pass the MFN and
the page. They are both expected to match.

Signed-off-by: Penny Zheng <penny.zheng@arm.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Julien Grall <jgrall@amazon.com>

commit 41c031ff437b66cfac4b120bd7698ca039850690
Author: Penny Zheng <penny.zheng@arm.com>
Date: Fri Sep 10 02:52:10 2021 +0000

xen/arm: introduce domain on Static Allocation

Static Allocation refers to system or sub-system(domains) for which memory
areas are pre-defined by configuration using physical address ranges.

Those pre-defined memory, -- Static Memory, as parts of RAM reserved in the
beginning, shall never go to heap allocator or boot allocator for any use.

Memory can be statically allocated to a domain using the property "xen,static-
mem" defined in the domain configuration. The number of cells for the address
and the size must be defined using respectively the properties
"#xen,static-mem-address-cells" and "#xen,static-mem-size-cells".

The property 'memory' is still needed and should match the amount of memory
given to the guest. Currently, it either comes from static memory or lets Xen
allocate from heap. *Mixing* is not supported.

The static memory will be mapped in the guest at the usual guest memory
addresses (GUEST_RAM0_BASE, GUEST_RAM1_BASE) defined by
xen/include/public/arch-arm.h.

This patch introduces this new `xen,static-mem` feature, and also documents
and parses this new attribute at boot time.

This patch also introduces a new field "bool xen_domain" in "struct membank"
to tell whether the memory bank is reserved as the whole hardware resource,
or bind to a xen domain node, through "xen,static-mem"

Signed-off-by: Penny Zheng <penny.zheng@arm.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>

commit 904ba3ce2e46e59080dc09676cede5df63b59f20
Author: Penny Zheng <penny.zheng@arm.com>
Date: Fri Sep 10 02:52:09 2021 +0000

xen/arm: introduce new helper device_tree_get_meminfo

This commit creates a new helper device_tree_get_meminfo to iterate over a
device tree property to get memory info, like "reg".

Signed-off-by: Penny Zheng <penny.zheng@arm.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>

commit a89bcd9737757e4d671783588a6041a84a5e1754
Author: Roger Pau Monne <roger.pau@citrix.com>
Date: Wed Jul 7 09:15:31 2021 +0200

tools/go: honor append build flags

Make the go build use APPEND_{C/LD}FLAGS when necessary, just like
other parts of the build.

Reported-by: Ting-Wei Lan <lantw44@gmail.com>
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Acked-by: Nick Rosbrook <rosbrookn@ainfosec.com>
Acked-by: Ian Jackson <iwj@xenproject.org>

commit 6d45368a0a89e01a3a01d156af61fea565db96cc
Author: Daniel P. Smith <dpsmith@apertussolutions.com>
Date: Fri Sep 10 16:12:59 2021 -0400

xsm: drop dubious xsm_op_t type

The type xsm_op_t masks the use of void pointers. This commit drops the
xsm_op_t type and replaces it and all its uses with an explicit void.

Signed-off-by: Daniel P. Smith <dpsmith@apertussolutions.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>

commit 2928c1d250b157fd4585ca47ba36ad4792723f1f
Author: Daniel P. Smith <dpsmith@apertussolutions.com>
Date: Fri Sep 10 16:12:58 2021 -0400

xsm: remove remnants of xsm_memtype hook

In c/s fcb8baddf00e the xsm_memtype hook was removed but some remnants were
left behind. This commit cleans up those remnants.

Signed-off-by: Daniel P. Smith <dpsmith@apertussolutions.com>
Acked-by: Jan Beulich <jbeulich@suse.com>

commit 4624912c0b5505387e53a12ef3417d001431a29d
Author: Daniel P. Smith <dpsmith@apertussolutions.com>
Date: Fri Sep 10 16:12:57 2021 -0400

xsm: remove the ability to disable flask

On Linux when SELinux is put into permissive mode the discretionary access
controls are still in place. Whereas for Xen when the enforcing state of flask
is set to permissive, all operations for all domains would succeed, i.e. it
does not fall back to the default access controls. To provide a means to mimic
a similar but not equivalent behaviour, a flask op is present to allow a
one-time switch back to the default access controls, aka the "dummy policy".

While this may be desirable for an OS, Xen is a hypervisor and should not
allow the switching of which security policy framework is being enforced after
boot. This patch removes the flask op to enforce the desired XSM usage model
requiring a reboot of Xen to change the XSM policy module in use.

Signed-off-by: Daniel P. Smith <dpsmith@apertussolutions.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>

commit f26bb285949b8c233816c4c6a87237ee14a06ebc
Author: Andrew Cooper <andrew.cooper3@citrix.com>
Date: Fri Sep 10 16:12:56 2021 -0400

xen: Implement xen/alternative-call.h for use in common code

The alternative call infrastructure is x86-only for now, but the common iommu
code has a variant and more common code wants to use the infrastructure.

Introduce CONFIG_ALTERNATIVE_CALL and a conditional implementation so common
code can use the optimisation when available, without requiring all
architectures to implement no-op stubs.

Write some documentation, which was thus far entirely absent, covering the
requirements for an architecture to implement this optimisation, and how to
use the infrastructure in general code.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Daniel P. Smith <dpsmith@apertussolutions.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
(qemu changes not included)
Re: [xen-unstable test] 164996: regressions - FAIL [ In reply to ]
On 16.09.2021 06:06, osstest service owner wrote:
> flight 164996 xen-unstable real [real]
> flight 165002 xen-unstable real-retest [real]
> http://logs.test-lab.xenproject.org/osstest/logs/164996/
> http://logs.test-lab.xenproject.org/osstest/logs/165002/
>
> Regressions :-(
>
> Tests which did not succeed and are blocking,
> including tests which could not be run:
> test-arm64-arm64-libvirt-raw 17 guest-start/debian.repeat fail REGR. vs. 164945

Since no-one gave a sign so far of looking into this failure, I took
a look despite having little hope to actually figure something. I'm
pretty sure the randomness of the "when" of this failure correlates
with

Sep 15 14:44:48.518439 [ 1613.227909] rpc-worker: page allocation failure: order:4, mode:0x40dc0(GFP_KERNEL|__GFP_COMP|__GFP_ZERO), nodemask=(null),cpuset=/,mems_allowed=0
Sep 15 14:44:55.418534 [ 1613.240888] CPU: 48 PID: 2029 Comm: rpc-worker Not tainted 5.4.17+ #1
Sep 15 14:44:55.430511 [ 1613.247370] Hardware name: Cavium ThunderX CN88XX board (DT)
Sep 15 14:44:55.430576 [ 1613.253099] Call trace:
Sep 15 14:44:55.442497 [ 1613.255620] dump_backtrace+0x0/0x140
Sep 15 14:44:55.442558 [ 1613.259348] show_stack+0x14/0x20
Sep 15 14:44:55.442606 [ 1613.262734] dump_stack+0xbc/0x100
Sep 15 14:44:55.442651 [ 1613.266206] warn_alloc+0xf8/0x160
Sep 15 14:44:55.454512 [ 1613.269677] __alloc_pages_slowpath+0x9c4/0x9f0
Sep 15 14:44:55.454574 [ 1613.274277] __alloc_pages_nodemask+0x1cc/0x248
Sep 15 14:44:55.466498 [ 1613.278878] kmalloc_order+0x24/0xa8
Sep 15 14:44:55.466559 [ 1613.282523] __kmalloc+0x244/0x270
Sep 15 14:44:55.466607 [ 1613.285995] alloc_empty_pages.isra.17+0x34/0xb0
Sep 15 14:44:55.478495 [ 1613.290681] privcmd_ioctl_mmap_batch.isra.20+0x414/0x428
Sep 15 14:44:55.478560 [ 1613.296149] privcmd_ioctl+0xbc/0xb7c
Sep 15 14:44:55.478608 [ 1613.299883] do_vfs_ioctl+0xb8/0xae0
Sep 15 14:44:55.490475 [ 1613.303527] ksys_ioctl+0x78/0xa8
Sep 15 14:44:55.490536 [ 1613.306911] __arm64_sys_ioctl+0x1c/0x28
Sep 15 14:44:55.490584 [ 1613.310906] el0_svc_common.constprop.2+0x88/0x150
Sep 15 14:44:55.502489 [ 1613.315765] el0_svc_handler+0x20/0x80
Sep 15 14:44:55.502551 [ 1613.319583] el0_svc+0x8/0xc

As per

Sep 15 14:44:55.502598 [ 1613.322585] Mem-Info:
Sep 15 14:44:55.502643 [ 1613.324918] active_anon:5639 inactive_anon:15857 isolated_anon:0
Sep 15 14:44:55.514480 [ 1613.324918] active_file:13286 inactive_file:11182 isolated_file:0
Sep 15 14:44:55.514545 [ 1613.324918] unevictable:0 dirty:30 writeback:0 unstable:0
Sep 15 14:44:55.526477 [ 1613.324918] slab_reclaimable:10922 slab_unreclaimable:30234
Sep 15 14:44:55.526540 [ 1613.324918] mapped:11277 shmem:10975 pagetables:401 bounce:0
Sep 15 14:44:55.538474 [ 1613.324918] free:8364 free_pcp:100 free_cma:1650

the system doesn't look to really be out of memory; as per

Sep 15 14:44:55.598538 [ 1613.419061] DMA32: 2788*4kB (UMEC) 890*8kB (UMEC) 497*16kB (UMEC) 36*32kB (UMC) 1*64kB (C) 1*128kB (C) 9*256kB (C) 7*512kB (C) 0*1024kB 0*2048kB 0*4096kB = 33456kB

there even look to be a number of higher order pages available (albeit
without digging I can't tell what "(C)" means). Nevertheless order-4
allocations aren't really nice.

What I can't see is why this may have started triggering recently. Was
the kernel updated in osstest? Is 512Mb of memory perhaps a bit too
small for a Dom0 on this system (with 96 CPUs)? Going through the log
I haven't been able to find crucial information like how much memory
the host has or what the hypervisor command line was.

Jan
Re: [xen-unstable test] 164996: regressions - FAIL [ In reply to ]
Jan Beulich writes ("Re: [xen-unstable test] 164996: regressions - FAIL"):
> As per
>
> Sep 15 14:44:55.502598 [ 1613.322585] Mem-Info:
> Sep 15 14:44:55.502643 [ 1613.324918] active_anon:5639 inactive_anon:15857 isolated_anon:0
> Sep 15 14:44:55.514480 [ 1613.324918] active_file:13286 inactive_file:11182 isolated_file:0
> Sep 15 14:44:55.514545 [ 1613.324918] unevictable:0 dirty:30 writeback:0 unstable:0
> Sep 15 14:44:55.526477 [ 1613.324918] slab_reclaimable:10922 slab_unreclaimable:30234
> Sep 15 14:44:55.526540 [ 1613.324918] mapped:11277 shmem:10975 pagetables:401 bounce:0
> Sep 15 14:44:55.538474 [ 1613.324918] free:8364 free_pcp:100 free_cma:1650
>
> the system doesn't look to really be out of memory; as per
>
> Sep 15 14:44:55.598538 [ 1613.419061] DMA32: 2788*4kB (UMEC) 890*8kB (UMEC) 497*16kB (UMEC) 36*32kB (UMC) 1*64kB (C) 1*128kB (C) 9*256kB (C) 7*512kB (C) 0*1024kB 0*2048kB 0*4096kB = 33456kB
>
> there even look to be a number of higher order pages available (albeit
> without digging I can't tell what "(C)" means). Nevertheless order-4
> allocations aren't really nice.

The host history suggests this may possibly be related to a qemu update.

http://logs.test-lab.xenproject.org/osstest/results/host/rochester0.html

> What I can't see is why this may have started triggering recently. Was
> the kernel updated in osstest? Is 512Mb of memory perhaps a bit too
> small for a Dom0 on this system (with 96 CPUs)? Going through the log
> I haven't been able to find crucial information like how much memory
> the host has or what the hypervisor command line was.

Logs from last host examination, including a dmesg:

http://logs.test-lab.xenproject.org/osstest/results/host/rochester0.examine/

Re the command line, does Xen not print it ?

The bootloader output seems garbled in the serial log.

Anyway, I think Xen is being booted EFI judging by the grub cfg:

http://logs.test-lab.xenproject.org/osstest/logs/165002/test-arm64-arm64-libvirt-raw/rochester0--grub.cfg.1

which means that it is probaly reading this:

http://logs.test-lab.xenproject.org/osstest/logs/165002/test-arm64-arm64-libvirt-raw/rochester0--xen.cfg

which gives this specification of the command line:

options=placeholder conswitch=x watchdog noreboot async-show-all console=dtuart dom0_mem=512M,max:512M ucode=scan

The grub cfg has this:

multiboot /xen placeholder conswitch=x watchdog noreboot async-show-all console=dtuart dom0_mem=512M,max:512M ucode=scan ${xen_rm_opts}

It's not clear to me whether xen_rm_opts is "" or "no-real-mode edd=off".

Ian
Re: [xen-unstable test] 164996: regressions - FAIL [ In reply to ]
On 20.09.2021 17:44, Ian Jackson wrote:
> Jan Beulich writes ("Re: [xen-unstable test] 164996: regressions - FAIL"):
>> As per
>>
>> Sep 15 14:44:55.502598 [ 1613.322585] Mem-Info:
>> Sep 15 14:44:55.502643 [ 1613.324918] active_anon:5639 inactive_anon:15857 isolated_anon:0
>> Sep 15 14:44:55.514480 [ 1613.324918] active_file:13286 inactive_file:11182 isolated_file:0
>> Sep 15 14:44:55.514545 [ 1613.324918] unevictable:0 dirty:30 writeback:0 unstable:0
>> Sep 15 14:44:55.526477 [ 1613.324918] slab_reclaimable:10922 slab_unreclaimable:30234
>> Sep 15 14:44:55.526540 [ 1613.324918] mapped:11277 shmem:10975 pagetables:401 bounce:0
>> Sep 15 14:44:55.538474 [ 1613.324918] free:8364 free_pcp:100 free_cma:1650
>>
>> the system doesn't look to really be out of memory; as per
>>
>> Sep 15 14:44:55.598538 [ 1613.419061] DMA32: 2788*4kB (UMEC) 890*8kB (UMEC) 497*16kB (UMEC) 36*32kB (UMC) 1*64kB (C) 1*128kB (C) 9*256kB (C) 7*512kB (C) 0*1024kB 0*2048kB 0*4096kB = 33456kB
>>
>> there even look to be a number of higher order pages available (albeit
>> without digging I can't tell what "(C)" means). Nevertheless order-4
>> allocations aren't really nice.
>
> The host history suggests this may possibly be related to a qemu update.
>
> http://logs.test-lab.xenproject.org/osstest/results/host/rochester0.html
>
>> What I can't see is why this may have started triggering recently. Was
>> the kernel updated in osstest? Is 512Mb of memory perhaps a bit too
>> small for a Dom0 on this system (with 96 CPUs)? Going through the log
>> I haven't been able to find crucial information like how much memory
>> the host has or what the hypervisor command line was.
>
> Logs from last host examination, including a dmesg:
>
> http://logs.test-lab.xenproject.org/osstest/results/host/rochester0.examine/
>
> Re the command line, does Xen not print it ?
>
> The bootloader output seems garbled in the serial log.
>
> Anyway, I think Xen is being booted EFI judging by the grub cfg:
>
> http://logs.test-lab.xenproject.org/osstest/logs/165002/test-arm64-arm64-libvirt-raw/rochester0--grub.cfg.1

Also judging by output seen in the log file.

> which means that it is probaly reading this:
>
> http://logs.test-lab.xenproject.org/osstest/logs/165002/test-arm64-arm64-libvirt-raw/rochester0--xen.cfg
>
> which gives this specification of the command line:
>
> options=placeholder conswitch=x watchdog noreboot async-show-all console=dtuart dom0_mem=512M,max:512M ucode=scan

Funny - about half of this look to be x86-only options.

But yes, this confirms my suspicion on this Dom0 getting limited to
512M of RAM.

> The grub cfg has this:
>
> multiboot /xen placeholder conswitch=x watchdog noreboot async-show-all console=dtuart dom0_mem=512M,max:512M ucode=scan ${xen_rm_opts}
>
> It's not clear to me whether xen_rm_opts is "" or "no-real-mode edd=off".

Which wouldn't matter - the two options are x86-only again, and hence
would (if anything) trigger log messages about unknown options. Such
log messages would be seen in the ring buffer only though, not on the
serial console (for getting issued too early).

Jan
Re: [xen-unstable test] 164996: regressions - FAIL [ In reply to ]
On Mon, 20 Sep 2021, Ian Jackson wrote:
> Jan Beulich writes ("Re: [xen-unstable test] 164996: regressions - FAIL"):
> > As per
> >
> > Sep 15 14:44:55.502598 [ 1613.322585] Mem-Info:
> > Sep 15 14:44:55.502643 [ 1613.324918] active_anon:5639 inactive_anon:15857 isolated_anon:0
> > Sep 15 14:44:55.514480 [ 1613.324918] active_file:13286 inactive_file:11182 isolated_file:0
> > Sep 15 14:44:55.514545 [ 1613.324918] unevictable:0 dirty:30 writeback:0 unstable:0
> > Sep 15 14:44:55.526477 [ 1613.324918] slab_reclaimable:10922 slab_unreclaimable:30234
> > Sep 15 14:44:55.526540 [ 1613.324918] mapped:11277 shmem:10975 pagetables:401 bounce:0
> > Sep 15 14:44:55.538474 [ 1613.324918] free:8364 free_pcp:100 free_cma:1650
> >
> > the system doesn't look to really be out of memory; as per
> >
> > Sep 15 14:44:55.598538 [ 1613.419061] DMA32: 2788*4kB (UMEC) 890*8kB (UMEC) 497*16kB (UMEC) 36*32kB (UMC) 1*64kB (C) 1*128kB (C) 9*256kB (C) 7*512kB (C) 0*1024kB 0*2048kB 0*4096kB = 33456kB
> >
> > there even look to be a number of higher order pages available (albeit
> > without digging I can't tell what "(C)" means). Nevertheless order-4
> > allocations aren't really nice.
>
> The host history suggests this may possibly be related to a qemu update.
>
> http://logs.test-lab.xenproject.org/osstest/results/host/rochester0.html
>
> > What I can't see is why this may have started triggering recently. Was
> > the kernel updated in osstest? Is 512Mb of memory perhaps a bit too
> > small for a Dom0 on this system (with 96 CPUs)? Going through the log
> > I haven't been able to find crucial information like how much memory
> > the host has or what the hypervisor command line was.
>
> Logs from last host examination, including a dmesg:
>
> http://logs.test-lab.xenproject.org/osstest/results/host/rochester0.examine/
>
> Re the command line, does Xen not print it ?
>
> The bootloader output seems garbled in the serial log.
>
> Anyway, I think Xen is being booted EFI judging by the grub cfg:
>
> http://logs.test-lab.xenproject.org/osstest/logs/165002/test-arm64-arm64-libvirt-raw/rochester0--grub.cfg.1
>
> which means that it is probaly reading this:
>
> http://logs.test-lab.xenproject.org/osstest/logs/165002/test-arm64-arm64-libvirt-raw/rochester0--xen.cfg
>
> which gives this specification of the command line:
>
> options=placeholder conswitch=x watchdog noreboot async-show-all console=dtuart dom0_mem=512M,max:512M ucode=scan
>
> The grub cfg has this:
>
> multiboot /xen placeholder conswitch=x watchdog noreboot async-show-all console=dtuart dom0_mem=512M,max:512M ucode=scan ${xen_rm_opts}
>
> It's not clear to me whether xen_rm_opts is "" or "no-real-mode edd=off".

I definitely recommend to increase dom0 memory, especially as I guess
the box is going to have a significant amount, far more than 4GB. I
would set it to 2GB. Also the syntax on ARM is simpler, so it should be
just: dom0_mem=2G

In addition, I also did some investigation just in case there is
actually a bug in the code and it is not a simple OOM problem.

Looking at the recent OSSTests results, the first failure is:
https://marc.info/?l=xen-devel&m=163145323631047
http://logs.test-lab.xenproject.org/osstest/logs/164951/

Indeed, the failure is the same test-arm64-arm64-libvirt-raw which is
still failing in more recent tests:
http://logs.test-lab.xenproject.org/osstest/logs/164951/test-arm64-arm64-libvirt-raw/info.html

But if we look at the commit id of flight 164951, it is
6d45368a0a89e01a3a01d156af61fea565db96cc "xsm: drop dubious xsm_op_t
type" by Daniel P. Smith (CCed).

It is interesting because:
- it is *before* all the recent ARM patch series
- it is only 4 commits after master


The 4 commits are:

2021-09-10 16:12 Daniel P. Smith o xsm: drop dubious xsm_op_t type
2021-09-10 16:12 Daniel P. Smith o xsm: remove remnants of xsm_memtype hook
2021-09-10 16:12 Daniel P. Smith o xsm: remove the ability to disable flask
2021-09-10 16:12 Andrew Cooper o xen: Implement xen/alternative-call.h for use in common code


Looking at them in details:

- "xen: Implement xen/alternative-call.h for use in common code" shouldn'
It shouldn't affect ARM at all

- "xsm: remove the ability to disable flask"
It would only affect the test case if libvirt directly or via libxl
calls FLASK_DISABLE.

- "xsm: remove remnants of xsm_memtype hook"
Shouldn't have any effects

- "xsm: drop dubious xsm_op_t type"
It doesn't look like it should have any runtime effect, only build time


So among these four, only "xsm: remove the ability to disable flask"
seems to have the potential to break a libvirt guest start test. Even
that, it is far fetched and the lack of an explicit XSM-related error
message in the logs would really point in the direction of an OOM.
Re: [xen-unstable test] 164996: regressions - FAIL [ In reply to ]
On 22.09.2021 01:38, Stefano Stabellini wrote:
> On Mon, 20 Sep 2021, Ian Jackson wrote:
>> Jan Beulich writes ("Re: [xen-unstable test] 164996: regressions - FAIL"):
>>> As per
>>>
>>> Sep 15 14:44:55.502598 [ 1613.322585] Mem-Info:
>>> Sep 15 14:44:55.502643 [ 1613.324918] active_anon:5639 inactive_anon:15857 isolated_anon:0
>>> Sep 15 14:44:55.514480 [ 1613.324918] active_file:13286 inactive_file:11182 isolated_file:0
>>> Sep 15 14:44:55.514545 [ 1613.324918] unevictable:0 dirty:30 writeback:0 unstable:0
>>> Sep 15 14:44:55.526477 [ 1613.324918] slab_reclaimable:10922 slab_unreclaimable:30234
>>> Sep 15 14:44:55.526540 [ 1613.324918] mapped:11277 shmem:10975 pagetables:401 bounce:0
>>> Sep 15 14:44:55.538474 [ 1613.324918] free:8364 free_pcp:100 free_cma:1650
>>>
>>> the system doesn't look to really be out of memory; as per
>>>
>>> Sep 15 14:44:55.598538 [ 1613.419061] DMA32: 2788*4kB (UMEC) 890*8kB (UMEC) 497*16kB (UMEC) 36*32kB (UMC) 1*64kB (C) 1*128kB (C) 9*256kB (C) 7*512kB (C) 0*1024kB 0*2048kB 0*4096kB = 33456kB
>>>
>>> there even look to be a number of higher order pages available (albeit
>>> without digging I can't tell what "(C)" means). Nevertheless order-4
>>> allocations aren't really nice.
>>
>> The host history suggests this may possibly be related to a qemu update.
>>
>> http://logs.test-lab.xenproject.org/osstest/results/host/rochester0.html

Stefano - as per some of your investigation detailed further down I
wonder whether you had seen this part of Ian's reply. (Question of
course then is how that qemu update had managed to get pushed.)

>> The grub cfg has this:
>>
>> multiboot /xen placeholder conswitch=x watchdog noreboot async-show-all console=dtuart dom0_mem=512M,max:512M ucode=scan ${xen_rm_opts}
>>
>> It's not clear to me whether xen_rm_opts is "" or "no-real-mode edd=off".
>
> I definitely recommend to increase dom0 memory, especially as I guess
> the box is going to have a significant amount, far more than 4GB. I
> would set it to 2GB. Also the syntax on ARM is simpler, so it should be
> just: dom0_mem=2G

Ian - I guess that's an adjustment relatively easy to make? I wonder
though whether we wouldn't want to address the underlying issue first.
Presumably not, because the fix would likely take quite some time to
propagate suitably. Yet if not, we will want to have some way of
verifying that an eventual fix there would have helped here.

> In addition, I also did some investigation just in case there is
> actually a bug in the code and it is not a simple OOM problem.

I think the actual issue is quite clear; what I'm struggling with is
why we weren't hit by it earlier.

As imo always, non-order-0 allocations (perhaps excluding the bringing
up of the kernel or whichever entity) are to be avoided it at possible.
The offender in this case looks to be privcmd's alloc_empty_pages().
For it to request through kcalloc() what ends up being an order-4
allocation, the original IOCTL_PRIVCMD_MMAPBATCH must specify a pretty
large chunk of guest memory to get mapped. Which may in turn be
questionable, but I'm afraid I don't have the time to try to drill
down where that request is coming from and whether that also wouldn't
better be split up.

The solution looks simple enough - convert from kcalloc() to kvcalloc().
I can certainly spin up a patch to Linux to this effect. Yet that still
won't answer the question of why this issue has popped up all of the
sudden (and hence whether there are things wanting changing elsewhere
as well).

Jan
Re: [xen-unstable test] 164996: regressions - FAIL [ In reply to ]
Jan Beulich writes ("Re: [xen-unstable test] 164996: regressions - FAIL"):
> On 22.09.2021 01:38, Stefano Stabellini wrote:
> > On Mon, 20 Sep 2021, Ian Jackson wrote:
> >>> Sep 15 14:44:55.598538 [ 1613.419061] DMA32: 2788*4kB (UMEC) 890*8kB (UMEC) 497*16kB (UMEC) 36*32kB (UMC) 1*64kB (C) 1*128kB (C) 9*256kB (C) 7*512kB (C) 0*1024kB 0*2048kB 0*4096kB = 33456kB
> >>>
> >>> there even look to be a number of higher order pages available (albeit
> >>> without digging I can't tell what "(C)" means). Nevertheless order-4
> >>> allocations aren't really nice.
> >>
> >> The host history suggests this may possibly be related to a qemu update.
> >>
> >> http://logs.test-lab.xenproject.org/osstest/results/host/rochester0.html
>
> Stefano - as per some of your investigation detailed further down I
> wonder whether you had seen this part of Ian's reply. (Question of
> course then is how that qemu update had managed to get pushed.)

I looked for bisection results for this failure and

http://logs.test-lab.xenproject.org/osstest/results/bisect/xen-unstable/test-arm64-arm64-libvirt-xsm.guest-start--debian.repeat.html

it's a heisenbug. Also, the tests got reorganised slightly as a
side-effect of dropping some i386 tests, so some of these tests are
"new" from osstest's pov, although their content isn't really new.

Unfortunately, with it being a heisenbug, we won't get any useful
bisection results, which would otherwise conclusively tell us which
tree the problem was in.

> >> The grub cfg has this:
> >>
> >> multiboot /xen placeholder conswitch=x watchdog noreboot async-show-all console=dtuart dom0_mem=512M,max:512M ucode=scan ${xen_rm_opts}
> >>
> >> It's not clear to me whether xen_rm_opts is "" or "no-real-mode edd=off".
> >
> > I definitely recommend to increase dom0 memory, especially as I guess
> > the box is going to have a significant amount, far more than 4GB. I
> > would set it to 2GB. Also the syntax on ARM is simpler, so it should be
> > just: dom0_mem=2G
>
> Ian - I guess that's an adjustment relatively easy to make? I wonder
> though whether we wouldn't want to address the underlying issue first.
> Presumably not, because the fix would likely take quite some time to
> propagate suitably. Yet if not, we will want to have some way of
> verifying that an eventual fix there would have helped here.

It could propagate fairly quickly. But I'm loathe to make this change
because it seems to me that it would be simply masking the bug.

Notably, when this goes wrong, it seems to happen after the guest has
been started once successfully already. So there *is* enough
memory...

Ian.
Re: [xen-unstable test] 164996: regressions - FAIL [ In reply to ]
On 22.09.2021 13:20, Ian Jackson wrote:
> Jan Beulich writes ("Re: [xen-unstable test] 164996: regressions - FAIL"):
>> On 22.09.2021 01:38, Stefano Stabellini wrote:
>>> On Mon, 20 Sep 2021, Ian Jackson wrote:
>>>>> Sep 15 14:44:55.598538 [ 1613.419061] DMA32: 2788*4kB (UMEC) 890*8kB (UMEC) 497*16kB (UMEC) 36*32kB (UMC) 1*64kB (C) 1*128kB (C) 9*256kB (C) 7*512kB (C) 0*1024kB 0*2048kB 0*4096kB = 33456kB
>>>>>
>>>>> there even look to be a number of higher order pages available (albeit
>>>>> without digging I can't tell what "(C)" means). Nevertheless order-4
>>>>> allocations aren't really nice.
>>>>
>>>> The host history suggests this may possibly be related to a qemu update.
>>>>
>>>> http://logs.test-lab.xenproject.org/osstest/results/host/rochester0.html
>>
>> Stefano - as per some of your investigation detailed further down I
>> wonder whether you had seen this part of Ian's reply. (Question of
>> course then is how that qemu update had managed to get pushed.)
>
> I looked for bisection results for this failure and
>
> http://logs.test-lab.xenproject.org/osstest/results/bisect/xen-unstable/test-arm64-arm64-libvirt-xsm.guest-start--debian.repeat.html
>
> it's a heisenbug. Also, the tests got reorganised slightly as a
> side-effect of dropping some i386 tests, so some of these tests are
> "new" from osstest's pov, although their content isn't really new.
>
> Unfortunately, with it being a heisenbug, we won't get any useful
> bisection results, which would otherwise conclusively tell us which
> tree the problem was in.

Quite unfortunate.

>>>> The grub cfg has this:
>>>>
>>>> multiboot /xen placeholder conswitch=x watchdog noreboot async-show-all console=dtuart dom0_mem=512M,max:512M ucode=scan ${xen_rm_opts}
>>>>
>>>> It's not clear to me whether xen_rm_opts is "" or "no-real-mode edd=off".
>>>
>>> I definitely recommend to increase dom0 memory, especially as I guess
>>> the box is going to have a significant amount, far more than 4GB. I
>>> would set it to 2GB. Also the syntax on ARM is simpler, so it should be
>>> just: dom0_mem=2G
>>
>> Ian - I guess that's an adjustment relatively easy to make? I wonder
>> though whether we wouldn't want to address the underlying issue first.
>> Presumably not, because the fix would likely take quite some time to
>> propagate suitably. Yet if not, we will want to have some way of
>> verifying that an eventual fix there would have helped here.
>
> It could propagate fairly quickly.

Is the Dom0 kernel used here a distro one or our own build of one of
the upstream trees? In the latter case I'd expect propagation to be
quite a bit faster than in the former case.

> But I'm loathe to make this change
> because it seems to me that it would be simply masking the bug.
>
> Notably, when this goes wrong, it seems to happen after the guest has
> been started once successfully already. So there *is* enough
> memory...

Well, there is enough memory, sure, but (transiently as it seems) not
enough contiguous chunks. The likelihood of higher order allocations
failing increases with smaller overall memory amounts (in Dom0 in this
case), afaict, unless there's (aggressive) de-fragmentation.

Jan
Re: [xen-unstable test] 164996: regressions - FAIL [ In reply to ]
Jan Beulich writes ("Re: [xen-unstable test] 164996: regressions - FAIL"):
> Is the Dom0 kernel used here a distro one or our own build of one of
> the upstream trees? In the latter case I'd expect propagation to be
> quite a bit faster than in the former case.

It's our own build.

> > But I'm loathe to make this change
> > because it seems to me that it would be simply masking the bug.
> >
> > Notably, when this goes wrong, it seems to happen after the guest has
> > been started once successfully already. So there *is* enough
> > memory...
>
> Well, there is enough memory, sure, but (transiently as it seems) not
> enough contiguous chunks. The likelihood of higher order allocations
> failing increases with smaller overall memory amounts (in Dom0 in this
> case), afaict, unless there's (aggressive) de-fragmentation.

Indeed.

I'm not sure, though, that I fully understand the design principles
behind non-order-0 allocations, and memory sizing, and so on. Your
earlier mail suggeted there may not be a design principle, and that
anything relying on non-order-0 atomic allocations is only working by
luck (or an embarassing excess of ram).

Ian.
Re: [xen-unstable test] 164996: regressions - FAIL [ In reply to ]
On 22.09.2021 14:29, Ian Jackson wrote:
> I'm not sure, though, that I fully understand the design principles
> behind non-order-0 allocations, and memory sizing, and so on. Your
> earlier mail suggeted there may not be a design principle, and that
> anything relying on non-order-0 atomic allocations is only working by
> luck (or an embarassing excess of ram).

That's what I think, yes. During boot and in certain other specific
places it may be okay to use such allocations, as long as failure
leads to something non-destructive. A process (or VM) not getting
created successfully _might_ be okay; a process or VM failing when
it already runs is not okay. Just to give an example. The situation
here falls in the latter category, at least from osstest's pov. IOW
assuming that what gets tested is a goal in terms of functionality,
VM creation failing when there is enough memory (just not in the
right "shape") is not okay here. Or else the test was wrongly put
in place.

Therefore a goal I've been trying to follow in the hypervisor is to
eliminate higher order allocations wherever possible. And I think
the kernel wants to follow suit here.

Jan
Re: [xen-unstable test] 164996: regressions - FAIL [ In reply to ]
On Wed, 22 Sep 2021, Jan Beulich wrote:
> On 22.09.2021 01:38, Stefano Stabellini wrote:
> > On Mon, 20 Sep 2021, Ian Jackson wrote:
> >> Jan Beulich writes ("Re: [xen-unstable test] 164996: regressions - FAIL"):
> >>> As per
> >>>
> >>> Sep 15 14:44:55.502598 [ 1613.322585] Mem-Info:
> >>> Sep 15 14:44:55.502643 [ 1613.324918] active_anon:5639 inactive_anon:15857 isolated_anon:0
> >>> Sep 15 14:44:55.514480 [ 1613.324918] active_file:13286 inactive_file:11182 isolated_file:0
> >>> Sep 15 14:44:55.514545 [ 1613.324918] unevictable:0 dirty:30 writeback:0 unstable:0
> >>> Sep 15 14:44:55.526477 [ 1613.324918] slab_reclaimable:10922 slab_unreclaimable:30234
> >>> Sep 15 14:44:55.526540 [ 1613.324918] mapped:11277 shmem:10975 pagetables:401 bounce:0
> >>> Sep 15 14:44:55.538474 [ 1613.324918] free:8364 free_pcp:100 free_cma:1650
> >>>
> >>> the system doesn't look to really be out of memory; as per
> >>>
> >>> Sep 15 14:44:55.598538 [ 1613.419061] DMA32: 2788*4kB (UMEC) 890*8kB (UMEC) 497*16kB (UMEC) 36*32kB (UMC) 1*64kB (C) 1*128kB (C) 9*256kB (C) 7*512kB (C) 0*1024kB 0*2048kB 0*4096kB = 33456kB
> >>>
> >>> there even look to be a number of higher order pages available (albeit
> >>> without digging I can't tell what "(C)" means). Nevertheless order-4
> >>> allocations aren't really nice.
> >>
> >> The host history suggests this may possibly be related to a qemu update.
> >>
> >> http://logs.test-lab.xenproject.org/osstest/results/host/rochester0.html
>
> Stefano - as per some of your investigation detailed further down I
> wonder whether you had seen this part of Ian's reply. (Question of
> course then is how that qemu update had managed to get pushed.)
>
> >> The grub cfg has this:
> >>
> >> multiboot /xen placeholder conswitch=x watchdog noreboot async-show-all console=dtuart dom0_mem=512M,max:512M ucode=scan ${xen_rm_opts}
> >>
> >> It's not clear to me whether xen_rm_opts is "" or "no-real-mode edd=off".
> >
> > I definitely recommend to increase dom0 memory, especially as I guess
> > the box is going to have a significant amount, far more than 4GB. I
> > would set it to 2GB. Also the syntax on ARM is simpler, so it should be
> > just: dom0_mem=2G
>
> Ian - I guess that's an adjustment relatively easy to make? I wonder
> though whether we wouldn't want to address the underlying issue first.
> Presumably not, because the fix would likely take quite some time to
> propagate suitably. Yet if not, we will want to have some way of
> verifying that an eventual fix there would have helped here.
>
> > In addition, I also did some investigation just in case there is
> > actually a bug in the code and it is not a simple OOM problem.
>
> I think the actual issue is quite clear; what I'm struggling with is
> why we weren't hit by it earlier.
>
> As imo always, non-order-0 allocations (perhaps excluding the bringing
> up of the kernel or whichever entity) are to be avoided it at possible.
> The offender in this case looks to be privcmd's alloc_empty_pages().
> For it to request through kcalloc() what ends up being an order-4
> allocation, the original IOCTL_PRIVCMD_MMAPBATCH must specify a pretty
> large chunk of guest memory to get mapped. Which may in turn be
> questionable, but I'm afraid I don't have the time to try to drill
> down where that request is coming from and whether that also wouldn't
> better be split up.
>
> The solution looks simple enough - convert from kcalloc() to kvcalloc().
> I can certainly spin up a patch to Linux to this effect. Yet that still
> won't answer the question of why this issue has popped up all of the
> sudden (and hence whether there are things wanting changing elsewhere
> as well).

Also, I saw your patches for Linux. Let's say that the patches are
reviewed and enqueued immediately to be sent to Linus at the next
opportunity. It is going to take a while for them to take effect in
OSSTest, unless we import them somehow in the Linux tree used by OSSTest
straight away, right?

Should we arrange for one test OSSTest flight now with the patches
applied to see if they actually fix the issue? Otherwise we might end up
waiting for nothing...
Re: [xen-unstable test] 164996: regressions - FAIL [ In reply to ]
Hi,

Sorry for the formatting.


On Thu, 23 Sep 2021, 06:10 Stefano Stabellini, <sstabellini@kernel.org>
wrote:

> On Wed, 22 Sep 2021, Jan Beulich wrote:
> > On 22.09.2021 01:38, Stefano Stabellini wrote:
> > > On Mon, 20 Sep 2021, Ian Jackson wrote:
> > >> Jan Beulich writes ("Re: [xen-unstable test] 164996: regressions -
> FAIL"):
> > >>> As per
> > >>>
> > >>> Sep 15 14:44:55.502598 [ 1613.322585] Mem-Info:
> > >>> Sep 15 14:44:55.502643 [ 1613.324918] active_anon:5639
> inactive_anon:15857 isolated_anon:0
> > >>> Sep 15 14:44:55.514480 [ 1613.324918] active_file:13286
> inactive_file:11182 isolated_file:0
> > >>> Sep 15 14:44:55.514545 [ 1613.324918] unevictable:0 dirty:30
> writeback:0 unstable:0
> > >>> Sep 15 14:44:55.526477 [ 1613.324918] slab_reclaimable:10922
> slab_unreclaimable:30234
> > >>> Sep 15 14:44:55.526540 [ 1613.324918] mapped:11277 shmem:10975
> pagetables:401 bounce:0
> > >>> Sep 15 14:44:55.538474 [ 1613.324918] free:8364 free_pcp:100
> free_cma:1650
> > >>>
> > >>> the system doesn't look to really be out of memory; as per
> > >>>
> > >>> Sep 15 14:44:55.598538 [ 1613.419061] DMA32: 2788*4kB (UMEC) 890*8kB
> (UMEC) 497*16kB (UMEC) 36*32kB (UMC) 1*64kB (C) 1*128kB (C) 9*256kB (C)
> 7*512kB (C) 0*1024kB 0*2048kB 0*4096kB = 33456kB
> > >>>
> > >>> there even look to be a number of higher order pages available
> (albeit
> > >>> without digging I can't tell what "(C)" means). Nevertheless order-4
> > >>> allocations aren't really nice.
> > >>
> > >> The host history suggests this may possibly be related to a qemu
> update.
> > >>
> > >>
> http://logs.test-lab.xenproject.org/osstest/results/host/rochester0.html
> >
> > Stefano - as per some of your investigation detailed further down I
> > wonder whether you had seen this part of Ian's reply. (Question of
> > course then is how that qemu update had managed to get pushed.)
> >
> > >> The grub cfg has this:
> > >>
> > >> multiboot /xen placeholder conswitch=x watchdog noreboot
> async-show-all console=dtuart dom0_mem=512M,max:512M ucode=scan
> ${xen_rm_opts}
> > >>
> > >> It's not clear to me whether xen_rm_opts is "" or "no-real-mode
> edd=off".
> > >
> > > I definitely recommend to increase dom0 memory, especially as I guess
> > > the box is going to have a significant amount, far more than 4GB. I
> > > would set it to 2GB. Also the syntax on ARM is simpler, so it should be
> > > just: dom0_mem=2G
> >
> > Ian - I guess that's an adjustment relatively easy to make? I wonder
> > though whether we wouldn't want to address the underlying issue first.
> > Presumably not, because the fix would likely take quite some time to
> > propagate suitably. Yet if not, we will want to have some way of
> > verifying that an eventual fix there would have helped here.
> >
> > > In addition, I also did some investigation just in case there is
> > > actually a bug in the code and it is not a simple OOM problem.
> >
> > I think the actual issue is quite clear; what I'm struggling with is
> > why we weren't hit by it earlier.
> >
> > As imo always, non-order-0 allocations (perhaps excluding the bringing
> > up of the kernel or whichever entity) are to be avoided it at possible.
> > The offender in this case looks to be privcmd's alloc_empty_pages().
> > For it to request through kcalloc() what ends up being an order-4
> > allocation, the original IOCTL_PRIVCMD_MMAPBATCH must specify a pretty
> > large chunk of guest memory to get mapped. Which may in turn be
> > questionable, but I'm afraid I don't have the time to try to drill
> > down where that request is coming from and whether that also wouldn't
> > better be split up.
> >
> > The solution looks simple enough - convert from kcalloc() to kvcalloc().
> > I can certainly spin up a patch to Linux to this effect. Yet that still
> > won't answer the question of why this issue has popped up all of the
> > sudden (and hence whether there are things wanting changing elsewhere
> > as well).
>
> Also, I saw your patches for Linux. Let's say that the patches are
> reviewed and enqueued immediately to be sent to Linus at the next
> opportunity. It is going to take a while for them to take effect in
> OSSTest, unless we import them somehow in the Linux tree used by OSSTest
> straight away, right?
>

For Arm testing we don't use a branch provided by Linux upstream. So your
wait will be forever :).


> Should we arrange for one test OSSTest flight now with the patches
> applied to see if they actually fix the issue? Otherwise we might end up
> waiting for nothing...


We could push the patch in the branch we have. However the Linux we use is
not fairly old (I think I did a push last year) and not even the latest
stable.

I can't remember whether we still have some patches on top of Linux to run
on arm (specifically 32-bit). So maybe we should start to track upstream
instead?

This will have the benefits to pick any new patches.

Cheers,

.





>
Re: [xen-unstable test] 164996: regressions - FAIL [ In reply to ]
On 23.09.2021 03:10, Stefano Stabellini wrote:
> On Wed, 22 Sep 2021, Jan Beulich wrote:
>> On 22.09.2021 01:38, Stefano Stabellini wrote:
>>> On Mon, 20 Sep 2021, Ian Jackson wrote:
>>>> Jan Beulich writes ("Re: [xen-unstable test] 164996: regressions - FAIL"):
>>>>> As per
>>>>>
>>>>> Sep 15 14:44:55.502598 [ 1613.322585] Mem-Info:
>>>>> Sep 15 14:44:55.502643 [ 1613.324918] active_anon:5639 inactive_anon:15857 isolated_anon:0
>>>>> Sep 15 14:44:55.514480 [ 1613.324918] active_file:13286 inactive_file:11182 isolated_file:0
>>>>> Sep 15 14:44:55.514545 [ 1613.324918] unevictable:0 dirty:30 writeback:0 unstable:0
>>>>> Sep 15 14:44:55.526477 [ 1613.324918] slab_reclaimable:10922 slab_unreclaimable:30234
>>>>> Sep 15 14:44:55.526540 [ 1613.324918] mapped:11277 shmem:10975 pagetables:401 bounce:0
>>>>> Sep 15 14:44:55.538474 [ 1613.324918] free:8364 free_pcp:100 free_cma:1650
>>>>>
>>>>> the system doesn't look to really be out of memory; as per
>>>>>
>>>>> Sep 15 14:44:55.598538 [ 1613.419061] DMA32: 2788*4kB (UMEC) 890*8kB (UMEC) 497*16kB (UMEC) 36*32kB (UMC) 1*64kB (C) 1*128kB (C) 9*256kB (C) 7*512kB (C) 0*1024kB 0*2048kB 0*4096kB = 33456kB
>>>>>
>>>>> there even look to be a number of higher order pages available (albeit
>>>>> without digging I can't tell what "(C)" means). Nevertheless order-4
>>>>> allocations aren't really nice.
>>>>
>>>> The host history suggests this may possibly be related to a qemu update.
>>>>
>>>> http://logs.test-lab.xenproject.org/osstest/results/host/rochester0.html
>>
>> Stefano - as per some of your investigation detailed further down I
>> wonder whether you had seen this part of Ian's reply. (Question of
>> course then is how that qemu update had managed to get pushed.)
>>
>>>> The grub cfg has this:
>>>>
>>>> multiboot /xen placeholder conswitch=x watchdog noreboot async-show-all console=dtuart dom0_mem=512M,max:512M ucode=scan ${xen_rm_opts}
>>>>
>>>> It's not clear to me whether xen_rm_opts is "" or "no-real-mode edd=off".
>>>
>>> I definitely recommend to increase dom0 memory, especially as I guess
>>> the box is going to have a significant amount, far more than 4GB. I
>>> would set it to 2GB. Also the syntax on ARM is simpler, so it should be
>>> just: dom0_mem=2G
>>
>> Ian - I guess that's an adjustment relatively easy to make? I wonder
>> though whether we wouldn't want to address the underlying issue first.
>> Presumably not, because the fix would likely take quite some time to
>> propagate suitably. Yet if not, we will want to have some way of
>> verifying that an eventual fix there would have helped here.
>>
>>> In addition, I also did some investigation just in case there is
>>> actually a bug in the code and it is not a simple OOM problem.
>>
>> I think the actual issue is quite clear; what I'm struggling with is
>> why we weren't hit by it earlier.
>>
>> As imo always, non-order-0 allocations (perhaps excluding the bringing
>> up of the kernel or whichever entity) are to be avoided it at possible.
>> The offender in this case looks to be privcmd's alloc_empty_pages().
>> For it to request through kcalloc() what ends up being an order-4
>> allocation, the original IOCTL_PRIVCMD_MMAPBATCH must specify a pretty
>> large chunk of guest memory to get mapped. Which may in turn be
>> questionable, but I'm afraid I don't have the time to try to drill
>> down where that request is coming from and whether that also wouldn't
>> better be split up.
>>
>> The solution looks simple enough - convert from kcalloc() to kvcalloc().
>> I can certainly spin up a patch to Linux to this effect. Yet that still
>> won't answer the question of why this issue has popped up all of the
>> sudden (and hence whether there are things wanting changing elsewhere
>> as well).
>
> Also, I saw your patches for Linux. Let's say that the patches are
> reviewed and enqueued immediately to be sent to Linus at the next
> opportunity. It is going to take a while for them to take effect in
> OSSTest, unless we import them somehow in the Linux tree used by OSSTest
> straight away, right?

Yes.

> Should we arrange for one test OSSTest flight now with the patches
> applied to see if they actually fix the issue? Otherwise we might end up
> waiting for nothing...

Not sure how easy it is to do one-off Linux builds then to be used in
hypervisor tests. Ian?

Jan
Re: [xen-unstable test] 164996: regressions - FAIL [ In reply to ]
On 23.09.2021 04:56, Julien Grall wrote:
> We could push the patch in the branch we have. However the Linux we use is
> not fairly old (I think I did a push last year) and not even the latest
> stable.

I don't think that's a problem here - this looks to be 5.4.17-ish, which
the patch should be good for (and it does apply cleanly to plain 5.4.0).

Ian, for your setting up of a one-off flight (as just talked about),
you can find the patch at
https://lists.xen.org/archives/html/xen-devel/2021-09/msg01691.html
(and perhaps in your mailbox). In case that's easier I've also attached
it here.

Jan
Re: [xen-unstable test] 164996: regressions - FAIL [ In reply to ]
Jan Beulich writes ("Re: [xen-unstable test] 164996: regressions - FAIL"):
> Ian, for your setting up of a one-off flight (as just talked about),
> you can find the patch at
> https://lists.xen.org/archives/html/xen-devel/2021-09/msg01691.html
> (and perhaps in your mailbox). In case that's easier I've also attached
> it here.
...
> [DELETED ATTACHMENT linux-5.15-rc2-xen-privcmd-mmap-kvcalloc.patch, plain text]

Thanks. The attachment didn't git-am but I managed to make a tree
with it in (but a bogus commit message).

I have a repro of 165218 test-arm64-arm64-libvirt-raw (that's the last
xen-unstable flight) running. If all goes well it will rebuild Linux
from my branch (new flight 165241) and then run the test using that
kernel (new flight 165242). I have told it to report to the people on
this thread (and the list).

It will probably report in an hour or two (since it needs to rebuild a
kernel and then negotiate to get a host to run the repro on).
I didn't ask it to keep the host for me, but it ought to publish the
logs and as I say, send an email report here.

Ian.

For my reference:

./mg-transient-task ./mg-repro-setup -P -E...,iwj@xenproject.org,... 165218 test-arm64-arm64-libvirt-raw X --rebuild +linux=https://xenbits.xen.org/git-http/people/iwj/linux.git#164996-fix alloc:equiv-rochester
Re: [xen-unstable test] 164996: regressions - FAIL [ In reply to ]
Ian Jackson writes ("Re: [xen-unstable test] 164996: regressions - FAIL"):
> Thanks. The attachment didn't git-am but I managed to make a tree
> with it in (but a bogus commit message).
>
> I have a repro of 165218 test-arm64-arm64-libvirt-raw (that's the last
> xen-unstable flight) running. If all goes well it will rebuild Linux
> from my branch (new flight 165241) and then run the test using that
> kernel (new flight 165242). I have told it to report to the people on
> this thread (and the list).
>
> It will probably report in an hour or two (since it needs to rebuild a
> kernel and then negotiate to get a host to run the repro on).
> I didn't ask it to keep the host for me, but it ought to publish the
> logs and as I say, send an email report here.

Restarted as 165323 and 165324. Maybe the thing won't catch fire this
time. Unusual consequences for a small kernel patch :-).

Ian.
Re: [xen-unstable test] 164996: regressions - FAIL [ In reply to ]
Ian Jackson writes ("Re: [xen-unstable test] 164996: regressions - FAIL"):
> Thanks. The attachment didn't git-am but I managed to make a tree
> with it in (but a bogus commit message).
>
> I have a repro of 165218 test-arm64-arm64-libvirt-raw (that's the last
> xen-unstable flight) running. If all goes well it will rebuild Linux
> from my branch (new flight 165241) and then run the test using that
> kernel (new flight 165242). I have told it to report to the people on
> this thread (and the list).
>
> It will probably report in an hour or two (since it needs to rebuild a
> kernel and then negotiate to get a host to run the repro on).
> I didn't ask it to keep the host for me, but it ought to publish the
> logs and as I say, send an email report here.

This was disrupted by the osstest failure. I'm running it again.
165354 and 165355.

Ian.

For my reference:

./mg-transient-task ./mg-repro-setup -P -Exen-devel@lists.xenproject.org,jbeulich@suse.com,julien.grall.oss@gmail.com,iwj@xenproject.org,sstabellini@kernel.org,dpsmith@apertussolutions.com 165218 test-arm64-arm64-libvirt-raw X --rebuild +linux=https://xenbits.xen.org/git-http/people/iwj/linux.git#164996-fix alloc:'{equiv-rochester,real}