Mailing List Archive

xen master: xl create hangs
Hi,

I'm testing the latest xen master branch (d8cb7e0f069e0f106d24941355b59b45a731eabe)
and when I create a domain with xl create, the domain is successfuly created, but the command hangs and never returns

After regaining control by sending a CTRL-C, and destroying the domain, I'm left with a
(null) domain for some reason.

I wanted to confirm this issue, so I installed xen-hypervisor-4.16-amd64 package (ubuntu 22.04), and I couldn't reproduce the issue there, xl create returns as expected.

This is how I compiled and configured Xen:

./configure --enable-systemd
make -j $(nproc) debball
sudo dpkg -i dist/xen-upstream-4.17-unstable.deb
sudo systemctl enable xen-qemu-dom0-disk-backend.service
sudo systemctl enable xen-init-dom0.service
sudo systemctl enable xenconsoled.service

This is the configuration I'm using to create my domain:
https://gist.github.com/Wenzel/ae16c83618dc375376cc0a18030ee777

Any ideas what's happening ?

Thanks !

Mathieu Tarral
Re: xen master: xl create hangs [ In reply to ]
Using gdb to debug the xl process, I get the following stacktrace:

(gdb) bt
#0 __futex_abstimed_wait_common64 (private=128, cancel=true, abstime=0x0, op=265, expected=8652, futex_word=0x7f6debd22a50) at ./nptl/futex-internal.c:57
#1 __futex_abstimed_wait_common (cancel=true, private=128, abstime=0x0, clockid=0, expected=8652, futex_word=0x7f6debd22a50) at ./nptl/futex-internal.c:87
#2 __GI___futex_abstimed_wait_cancelable64 (futex_word=futex_word@entry=0x7f6debd22a50, expected=8652, clockid=clockid@entry=0, abstime=abstime@entry=0x0,
private=private@entry=128) at ./nptl/futex-internal.c:139
#3 0x00007f6deba736a4 in __pthread_clockjoin_ex (threadid=140110084581248, thread_return=thread_return@entry=0x0, clockid=clockid@entry=0,
abstime=abstime@entry=0x0, block=block@entry=true) at ./nptl/pthread_join_common.c:105
#4 0x00007f6deba73543 in ___pthread_join (threadid=<optimized out>, thread_return=thread_return@entry=0x0) at ./nptl/pthread_join.c:24
#5 0x00007f6deb9a144b in xs_daemon_close (h=0x561db3bc5bc0) at xs.c:366
#6 0x00007f6deb9a145f in xs_close (xsh=<optimized out>) at xs.c:386
#7 0x00007f6debc43a36 in libxl_ctx_free (ctx=0x561db3bc52e0) at libxl.c:173
#8 0x0000561db33bf5a3 in xl_ctx_free () at xl.c:370
#9 0x00007f6deba22495 in __run_exit_handlers (status=0, listp=0x7f6debbf6838 <__exit_funcs>, run_list_atexit=run_list_atexit@entry=true,
run_dtors=run_dtors@entry=true) at ./stdlib/exit.c:113
#10 0x00007f6deba22610 in __GI_exit (status=<optimized out>) at ./stdlib/exit.c:143
#11 0x00007f6deba06d97 in __libc_start_call_main (main=main@entry=0x561db33c0425 <main>, argc=argc@entry=4, argv=argv@entry=0x7ffeb2f263d8)
at ../sysdeps/nptl/libc_start_call_main.h:74
#12 0x00007f6deba06e40 in __libc_start_main_impl (main=0x561db33c0425 <main>, argc=4, argv=0x7ffeb2f263d8, init=<optimized out>, fini=<optimized out>,
rtld_fini=<optimized out>, stack_end=0x7ffeb2f263c8) at ../csu/libc-start.c:392
#13 0x0000561db33bf425 in _start ()

Colorized version in a Github Gist:
https://gist.github.com/Wenzel/4da1e0a025954fac13a0ee57147cc44f

So looks like xs_daemon_close is waiting on a thread to join:
https://github.com/xen-project/xen/blob/a5fb66f4513c2c2d222dcc3753163b15690bd003/tools/libs/store/xs.c#L366

"A read thread which pulls messages off the comms channel and signals waiters."

Any ideas what could go wrong ?
How to debug this further ?

Thanks,

Mathieu
Re: xen master: xl create hangs [ In reply to ]
On Tue, Jul 19, 2022 at 4:26 AM Mathieu Tarral <
mathieu.tarral@protonmail.com> wrote:

> Using gdb to debug the xl process, I get the following stacktrace:
>
> (gdb) bt
> #0 __futex_abstimed_wait_common64 (private=128, cancel=true, abstime=0x0,
> op=265, expected=8652, futex_word=0x7f6debd22a50) at
> ./nptl/futex-internal.c:57
> #1 __futex_abstimed_wait_common (cancel=true, private=128, abstime=0x0,
> clockid=0, expected=8652, futex_word=0x7f6debd22a50) at
> ./nptl/futex-internal.c:87
> #2 __GI___futex_abstimed_wait_cancelable64 (futex_word=futex_word@entry=0x7f6debd22a50,
> expected=8652, clockid=clockid@entry=0, abstime=abstime@entry=0x0,
> private=private@entry=128) at ./nptl/futex-internal.c:139
> #3 0x00007f6deba736a4 in __pthread_clockjoin_ex
> (threadid=140110084581248, thread_return=thread_return@entry=0x0,
> clockid=clockid@entry=0,
> abstime=abstime@entry=0x0, block=block@entry=true) at
> ./nptl/pthread_join_common.c:105
> #4 0x00007f6deba73543 in ___pthread_join (threadid=<optimized out>,
> thread_return=thread_return@entry=0x0) at ./nptl/pthread_join.c:24
> #5 0x00007f6deb9a144b in xs_daemon_close (h=0x561db3bc5bc0) at xs.c:366
> #6 0x00007f6deb9a145f in xs_close (xsh=<optimized out>) at xs.c:386
> #7 0x00007f6debc43a36 in libxl_ctx_free (ctx=0x561db3bc52e0) at
> libxl.c:173
> #8 0x0000561db33bf5a3 in xl_ctx_free () at xl.c:370
> #9 0x00007f6deba22495 in __run_exit_handlers (status=0,
> listp=0x7f6debbf6838 <__exit_funcs>, run_list_atexit=run_list_atexit@entry
> =true,
> run_dtors=run_dtors@entry=true) at ./stdlib/exit.c:113
> #10 0x00007f6deba22610 in __GI_exit (status=<optimized out>) at
> ./stdlib/exit.c:143
> #11 0x00007f6deba06d97 in __libc_start_call_main (main=main@entry=0x561db33c0425
> <main>, argc=argc@entry=4, argv=argv@entry=0x7ffeb2f263d8)
> at ../sysdeps/nptl/libc_start_call_main.h:74
> #12 0x00007f6deba06e40 in __libc_start_main_impl (main=0x561db33c0425
> <main>, argc=4, argv=0x7ffeb2f263d8, init=<optimized out>, fini=<optimized
> out>,
> rtld_fini=<optimized out>, stack_end=0x7ffeb2f263c8) at
> ../csu/libc-start.c:392
> #13 0x0000561db33bf425 in _start ()
>
> Colorized version in a Github Gist:
> https://gist.github.com/Wenzel/4da1e0a025954fac13a0ee57147cc44f
>
> So looks like xs_daemon_close is waiting on a thread to join:
>
> https://github.com/xen-project/xen/blob/a5fb66f4513c2c2d222dcc3753163b15690bd003/tools/libs/store/xs.c#L366
>
> "A read thread which pulls messages off the comms channel and signals
> waiters."
>
> Any ideas what could go wrong ?
> How to debug this further ?
>

Unfortunately, the way the async stuff in libxl is implemented, the stack
trace won't be particularly useful. Can you run it with the `-vvvvvvvvvv`
command line option? (I don't know exactly how many v's; just put in a lot
and you'll get all the debug info.)

The other thing to include would be the output of `xl dmesg`, and the
information in /var/log/xen/ pertaining to that domain.

Another way to track this down would be to do a 'git bisect', between
`master` and `RELEASE-4.16.0` to see where things stopped working; but
unless you feel particularly motivated, it's probably something to try
after we've looked at the other information to see if we can figure out
what's gone wrong.

-George
Re: xen master: xl create hangs [ In reply to ]
Hi George

> Unfortunately, the way the async stuff in libxl is implemented, the stack trace won't be particularly useful. Can you run it with the `-vvvvvvvvvv` command line option? (I don't know exactly how many v's; just put in a lot and you'll get all the debug info.)
>
> The other thing to include would be the output of `xl dmesg`, and the information in /var/log/xen/ pertaining to that domain.
>
> Another way to track this down would be to do a 'git bisect', between `master` and `RELEASE-4.16.0` to see where things stopped working; but unless you feel particularly motivated, it's probably something to try after we've looked at the other information to see if we can figure out what's gone wrong.

I've added the following files on this Gist:
https://gist.github.com/Wenzel/4da1e0a025954fac13a0ee57147cc44f

- xl_dmesg.log: xl dmesg
- qemu-dm-ubuntu.log: /var/log/xen/qemu-dm-ubuntu.log
- xl-ubuntu.log: /var/log/xen/xl-ubuntu.log
- xl_create_hvm_cfg.log: xl create -vvvvvvvv hvm.cfg

I hope this helps !!

Thank you,
Mathieu
Re: xen master: xl create hangs [ In reply to ]
Hi George

I did a git bisect between master (bad) and xen-RELEASE-1.46.1 (good) to try to
find the commit which introduced the bug.

Git bisect identified 74a11c43fd7e074b1f77631b446dd2115eacb9e8 merge commit as bad,
and deduced that the bug has been fixed between 74a11c43fd7e074b1f77631b446dd2115eacb9e8 and f26544492298cb82d66f9bf36e29d2f75b3133f2.

I don't know how to investigate further and locate the bad commit ?
If you have any ideas,

Mathieu
Re: xen master: xl create hangs [ In reply to ]
> On 20 Jul 2022, at 10:10, Mathieu Tarral <mathieu.tarral@protonmail.com> wrote:
>
> Hi George
>
> I did a git bisect between master (bad) and xen-RELEASE-1.46.1 (good) to try to
> find the commit which introduced the bug.
>
> Git bisect identified 74a11c43fd7e074b1f77631b446dd2115eacb9e8 merge commit as bad,
> and deduced that the bug has been fixed between 74a11c43fd7e074b1f77631b446dd2115eacb9e8 and f26544492298cb82d66f9bf36e29d2f75b3133f2.

So f2654… is actually just RELEASE-4.16.1. :-) I think the issue is that you’re trying to do a bisect across different branches. What you need to do is to find the point where RELEASE-4.16.1 and master diverged; you can do that with `git merge-base`:

$ git merge-base RELEASE-4.16.1 master

Verify that things work properly at *that* commit, then use that as the “good” starting point.

Unfortunately I’m not really doing Xen development anymore, so I can’t help dig into the logs or try to repro the issue. (One of the unusual thing you’re doing, it seems, is to boot with *only* a cd-rom drive, and no ’normal’ drives, or even any network cards.). I’ll try to get someone else to take a look at this thread.

-George
Re: xen master: xl create hangs [ In reply to ]
Hi George


> So f2654… is actually just RELEASE-4.16.1. :-) I think the issue is that you’re trying to do a bisect across different branches. What you need to do is to find the point where RELEASE-4.16.1 and master diverged; you can do that with `git merge-base`:
>
> $ git merge-base RELEASE-4.16.1 master

TIL: git merge-base

>
> Verify that things work properly at that commit, then use that as the “good” starting point.

Turns out that this commit (74a11c43fd7e074b1f77631b446dd2115eacb9e8) was also bad.
So I used git bisect again, but this time to find the commit which introduced the bug fix
between 74a11c43fd7e074b1f77631b446dd2115eacb9e8 and RELEASE-4.16.1.

After a few steps, git bisect identified this commit:
https://github.com/xen-project/xen/commit/59505f48fabed2e6fa5ad992edaabeb4a1441599
"Turn off debug by default"
Surprisingly simple.

And I confirm that it's one that fixes the issue of xl create hanging.

I cherry-picked this commit on master:
https://user-images.githubusercontent.com/964610/179986382-a774c91a-7b68-416b-9dbe-226b8aca0673.png

recompiled and tested again, my master branch now works as expected, tested with the small config file I already had and the XTF test-pv64-example.

So it works, but I don't know why this commit fixed it.

Best,
Mathieu
Re: xen master: xl create hangs [ In reply to ]
CCing Juergen and xen-devel.

On Mon, Jul 18, 2022 at 06:25:54PM +0000, Mathieu Tarral wrote:
> Using gdb to debug the xl process, I get the following stacktrace:
>
> (gdb) bt
> #0 __futex_abstimed_wait_common64 (private=128, cancel=true, abstime=0x0, op=265, expected=8652, futex_word=0x7f6debd22a50) at ./nptl/futex-internal.c:57
> #1 __futex_abstimed_wait_common (cancel=true, private=128, abstime=0x0, clockid=0, expected=8652, futex_word=0x7f6debd22a50) at ./nptl/futex-internal.c:87
> #2 __GI___futex_abstimed_wait_cancelable64 (futex_word=futex_word@entry=0x7f6debd22a50, expected=8652, clockid=clockid@entry=0, abstime=abstime@entry=0x0,
> private=private@entry=128) at ./nptl/futex-internal.c:139
> #3 0x00007f6deba736a4 in __pthread_clockjoin_ex (threadid=140110084581248, thread_return=thread_return@entry=0x0, clockid=clockid@entry=0,
> abstime=abstime@entry=0x0, block=block@entry=true) at ./nptl/pthread_join_common.c:105
> #4 0x00007f6deba73543 in ___pthread_join (threadid=<optimized out>, thread_return=thread_return@entry=0x0) at ./nptl/pthread_join.c:24
> #5 0x00007f6deb9a144b in xs_daemon_close (h=0x561db3bc5bc0) at xs.c:366
> #6 0x00007f6deb9a145f in xs_close (xsh=<optimized out>) at xs.c:386
> #7 0x00007f6debc43a36 in libxl_ctx_free (ctx=0x561db3bc52e0) at libxl.c:173
> #8 0x0000561db33bf5a3 in xl_ctx_free () at xl.c:370
> #9 0x00007f6deba22495 in __run_exit_handlers (status=0, listp=0x7f6debbf6838 <__exit_funcs>, run_list_atexit=run_list_atexit@entry=true,
> run_dtors=run_dtors@entry=true) at ./stdlib/exit.c:113
> #10 0x00007f6deba22610 in __GI_exit (status=<optimized out>) at ./stdlib/exit.c:143
> #11 0x00007f6deba06d97 in __libc_start_call_main (main=main@entry=0x561db33c0425 <main>, argc=argc@entry=4, argv=argv@entry=0x7ffeb2f263d8)
> at ../sysdeps/nptl/libc_start_call_main.h:74
> #12 0x00007f6deba06e40 in __libc_start_main_impl (main=0x561db33c0425 <main>, argc=4, argv=0x7ffeb2f263d8, init=<optimized out>, fini=<optimized out>,
> rtld_fini=<optimized out>, stack_end=0x7ffeb2f263c8) at ../csu/libc-start.c:392
> #13 0x0000561db33bf425 in _start ()
>
> Colorized version in a Github Gist:
> https://gist.github.com/Wenzel/4da1e0a025954fac13a0ee57147cc44f
>
> So looks like xs_daemon_close is waiting on a thread to join:
> https://github.com/xen-project/xen/blob/a5fb66f4513c2c2d222dcc3753163b15690bd003/tools/libs/store/xs.c#L366

On Wed, Jul 20, 2022 at 12:53:29PM +0000, Mathieu Tarral wrote:
> > Verify that things work properly at that commit, then use that as the “good” starting point.
>
> Turns out that this commit (74a11c43fd7e074b1f77631b446dd2115eacb9e8) was also bad.
> So I used git bisect again, but this time to find the commit which introduced the bug fix
> between 74a11c43fd7e074b1f77631b446dd2115eacb9e8 and RELEASE-4.16.1.
>
> After a few steps, git bisect identified this commit:
> https://github.com/xen-project/xen/commit/59505f48fabed2e6fa5ad992edaabeb4a1441599
> "Turn off debug by default"
> Surprisingly simple.
>
> And I confirm that it's one that fixes the issue of xl create hanging.
>
> I cherry-picked this commit on master:
> https://user-images.githubusercontent.com/964610/179986382-a774c91a-7b68-416b-9dbe-226b8aca0673.png
>
> recompiled and tested again, my master branch now works as expected, tested with the small config file I already had and the XTF test-pv64-example.
>
> So it works, but I don't know why this commit fixed it.

$(debug) controls the level of optimisation of the compilation to make
it easier to debug.

So, with debug=y, we have libxenstore having issue with killing the
its reading thread? :-(
Maybe that reading thread is doing something that can't be stopped,
maybe it's waiting for a lock. Could you try to print a back trace of
that thread (or even all thread in `xl`)? ("thread apply all bt full" in gdb)

Thanks,

--
Anthony PERARD
Re: xen master: xl create hangs [ In reply to ]
Hi Anthony


> $(debug) controls the level of optimisation of the compilation to make
> it easier to debug.
>
> So, with debug=y, we have libxenstore having issue with killing the
> its reading thread? :-(
> Maybe that reading thread is doing something that can't be stopped,
> maybe it's waiting for a lock. Could you try to print a back trace of
> that thread (or even all thread in `xl`)? ("thread apply all bt full" in gdb)

I recompiled the buggy master, and this is the full GDB stacktrace when xl create hangs:
https://gist.github.com/Wenzel/969d5c06982246cd6cb2eb8cdf252a18

I don't see the same stacktrace as before, maybe I was on a different commit ?

I hope this helps.

Mathieu
Re: xen master: xl create hangs [ In reply to ]
On Wed, Jul 20, 2022 at 03:04:22PM +0000, Mathieu Tarral wrote:
> Hi Anthony
>
>
> > $(debug) controls the level of optimisation of the compilation to make
> > it easier to debug.
> >
> > So, with debug=y, we have libxenstore having issue with killing the
> > its reading thread? :-(
> > Maybe that reading thread is doing something that can't be stopped,
> > maybe it's waiting for a lock. Could you try to print a back trace of
> > that thread (or even all thread in `xl`)? ("thread apply all bt full" in gdb)
>
> I recompiled the buggy master, and this is the full GDB stacktrace when xl create hangs:
> https://gist.github.com/Wenzel/969d5c06982246cd6cb2eb8cdf252a18
>
> I don't see the same stacktrace as before, maybe I was on a different commit ?

I think that this `xl` process just wait for the domain to shutdown
or die. When we run `xl create`, before exiting there's a fork/exec of
xl which handle a few domain events, so it looks like this stack trace
is expected (and look like the one I have). So it don't looks like to be
the xl process that hangs.

--
Anthony PERARD
Re: xen master: xl create hangs [ In reply to ]
> I think that this `xl` process just wait for the domain to shutdown
> or die. When we run `xl create`, before exiting there's a fork/exec of
> xl which handle a few domain events, so it looks like this stack trace
> is expected (and look like the one I have). So it don't looks like to be
> the xl process that hangs.

I tested again but this time with XTF test-pv64-example:
https://user-images.githubusercontent.com/964610/180044164-74d12f63-d901-4e33-93be-073c7ed8d7dc.png

This is the new xl stacktrace:
https://gist.github.com/Wenzel/969d5c06982246cd6cb2eb8cdf252a18#file-gdb2-xs-daemon-close-c

It now shows the first thread waiting on the reading thread to join, as we expected:
https://github.com/xen-project/xen/blob/0e60f1d9d1970cae49ee9d03f5759f44afc1fdee/tools/libs/store/xs.c#L366

And the second one waiting in read_message:
https://github.com/xen-project/xen/blob/0e60f1d9d1970cae49ee9d03f5759f44afc1fdee/tools/libs/store/xs.c#L1265

Mathieu
Re: xen master: xl create hangs [ In reply to ]
On Wed, Jul 20, 2022 at 05:24:22PM +0100, Anthony PERARD wrote:
>
> I think that this `xl` process just wait for the domain to shutdown
> or die. When we run `xl create`, before exiting there's a fork/exec of
> xl which handle a few domain events, so it looks like this stack trace
> is expected (and look like the one I have). So it don't looks like to be
> the xl process that hangs.

I've got a patch to make use of `setproctitle()` to modify what shows as
the process name for this process. Unfortunately `setproctitle()` is
*BSD-only, "libbsd" for Linux implements similar functionality and I've
been meaning to figure out how it works.

I definitely think this should be done, just haven't gotten around to
finding a proper way to do it.


--
(\___(\___(\______ --=> 8-) EHM <=-- ______/)___/)___/)
\BS ( | ehem+sigmsg@m5p.com PGP 87145445 | ) /
\_CS\ | _____ -O #include <stddisclaimer.h> O- _____ | / _/
8A19\___\_|_/58D2 7E3D DDF4 7BA6 <-PGP-> 41D1 B375 37D0 8714\_|_/___/5445