Mailing List Archive: xen dom0 nfs hangs

xen dom0 nfs hangs

May 31, 2015, 12:09 PM

Post #1 of 3 (822 views)

Hi,

I have xen running most recently under ubuntu 15 with a host which runs
a small number of domUs doing small non-intensive jobs (dns serving,
spam filtering, radius). These dom0 nfs mounts a directory holding the
domU disk image files on my netapp filer and the domU config files are
all using loopback mounts for these disk images. Occasionally, for some
reason I have yet to fathom, NFS simply stops working from the dom0 and
all processes accessing nfs simply hang. I get messages about 'task
blocked more than 120 seconds' (from qemu-system-i386) and so forth; the
dom0 is otherwise responsive, is not swapping, high load, any other
kernel messages, it's simply that NFS has gone away. Other dom0 hosts
nfs mounting domU disk image files from this same filer, have no
problems at all. The domU's on this affected xen host hang - networking
is still working, they are ping reachable and anything not depending on
disk access from inside the domU itself continues to work, but any
process that touches disk (sendmail for example), is hung.

I have taken the following troubleshooting steps;

The host originally was an AMD box, running Ubuntu 14. I tried all
of the memory tuning advice, minimum dom0 memory, cpu pinning, etc. NFS
continued to have hangs.

I upgraded the box to an intel hexacore platform with 64g of ram.
Same problems.

I installed a dedicated 4port gigE nic and put the NFS traffic onto
it's own bonded port channel. Same problems.

I upgraded to ubuntu 15. Same problems.

I tuned even more kernel variables such as swappines, dirty cache
and so forth, down to almost nothing. Same problems.

I have SPAN capturing all network traffic to and from the box,
during the problem period. Nothing I can see going obviously wrong, but
I don't have good tools beyond tcpdump to really go into traffic however.

I have arpwatch running to make sure we don't have an ip conflict
on the nfs network. Nothing noted.

I have the switches doing extended debugging for all interface
state transitions, stp transitions, nothing noted, no errors, everything
is clean and good.

I have had the experience where, during a period of NFS hang
lasting more than 2 hours, it suddenly comes right back and picks up
where it left off, all vm's suddently come back to life and things are
all good again.

The short fix for when this occurs, is to simply reboot the box.
Then everything just comes back and all is well. But, the problem
continues unabated and I have been fighting this for too long. The best
I can guess, is that it's "something" with nfs, but that is all. If I
can't find a solution soon, I would be willing to consider other storage
methods including iSCSI. The issue with that however is that nfs makes
sense to me, I can deal with it, I know how to back it up, how to manage
the space and the mounts and such, and iSCSI is an enigma to me. There
hasn't been any really good howto's or other documents showing how
really to connect all the pieces, unless someone has a pointer they can
shoot my way. I'd love to understand what actually appears to be killing
nfs and to fix that problem instead, but at this point just getting away
from this problem and restoring stability here is more important.

Thank you.

Mike-

_______________________________________________
Xen-users mailing list
Xen-users@lists.xen.org
http://lists.xen.org/xen-users

Re: xen dom0 nfs hangs [ In reply to ]

ian.campbell at citrix

Jun 1, 2015, 6:24 AM

Post #2 of 3 (808 views)

Permalink

On Sun, 2015-05-31 at 12:09 -0700, Mike wrote:

I've no idea what might be going here, but:

> Hi,I get messages about 'task blocked more than 120 seconds' (from qemu-system-i386) and so forth;

The stack traces from those messages may prove informative, since they
will indicate where (and hopefully therefore why) that process has been
blocked for so long.

I don't think storing guest filesystem images on an NFS share as you are
doing is in any way uncommon.

You say you are using loopback mounts, I suppose you mean
literally /dev/loop0 etc (either explicitly via losetup or implicitly
via the toolstack)?

In the scenarios I'm aware of people tend to use either tapdisk or qdisk
(from qemu) to expose files on NFS as guest disks. Mostly they are using
vhd or qcow2 (so /dev/loop is not an option), but I wonder if switching
to e.g. qdisk would help? (Switching to tapdisk would involve several
yakk shaving exercises I suspect, not worth it IMHO)

On the other hand you mention qemu-system-i386 so perhaps you are
already using qdisk? Or are these guests HVM ones?

Ian.

_______________________________________________
Xen-users mailing list
Xen-users@lists.xen.org
http://lists.xen.org/xen-users

Re: xen dom0 nfs hangs [ In reply to ]

krichy at tvnetwork

Jul 16, 2015, 2:52 AM

Post #3 of 3 (779 views)

Permalink

Dear Xen users,

I also had an issue strangely with the same symptoms as Mike had.

We run ganeti, all nodes have a /srv/ganeti/shared-file-storage mountpoint
from an nfs server, and with debian jessie's xen (4.4) if a node has a
high overall iops from all of its guests, it will stall sometimes. The
same problem as Mike's: qemu processes remain in D state, the nfs path can
be read but not written to it.

Mike, could you solve your issue somehow?

Earlier we were using debian wheezy with kernel 3.2 and xen 4.1. I dont
know id the linux nfs client implementation has some bug or the handling
of file backed vbds cause the issue.

Unfortunately I cannot reproduce the problem, it just arises randomly. I
assume it is related to high iops.

Thanks in advance,

Kojedzinszky Richard
Euronet Magyarorszag Informatika Zrt.

_______________________________________________
Xen-users mailing list
Xen-users@lists.xen.org
http://lists.xen.org/xen-users