Mailing List Archive

Snapshot causes disk errors and VM crashes
Hi,

I have a problem that's occurred a couple of times with both my test and
production systems. I'd like to know if I'm doing something wrong or it
could be a potential bug.

I have a Debian 7 wheezy dom0 running a Debian 7 wheezy domU. I plan on
deploying multiple instances but right now I'm building one to replace an
existing production server.

I installed the xs-tools on the VM from the XenServer 6.2 distribution, and
most everything works just fine, however I have stumbled on an issue that's
struck me 3 times on 2 serparate servers. When I take a live snapshot,
*sometimes* (not always) the guest OS pukes and starts throwing disk
errors. It effectively nails the guest OS And causes it to die with miles
of IO errros all over the console. It's not pretty :\

Any ideas? I'm pretty sure in the latest instance the guest OS was fully
loaded and sitting at a login prompt at the time, so it wasn't the lack of
the xs-tool daemon as far as I can see. Maybe something a process was doing
in the background on the guest interfered?

This isn't a critical issue for me right now, I just need to remember to
snapshot when the guest VM is offline, but if I need to snapshot once this
is in production it could cause an unintended outage and downtime (which
make me slightly unpopular!).

--

Mark Benson
Re: Snapshot causes disk errors and VM crashes [ In reply to ]
> I installed the xs-tools on the VM from the XenServer 6.2 distribution, and
> most everything works just fine, however I have stumbled on an issue that's
> struck me 3 times on 2 serparate servers. When I take a live snapshot,
> *sometimes* (not always) the guest OS pukes and starts throwing disk
> errors. It effectively nails the guest OS And causes it to die with miles of IO
> errros all over the console. It's not pretty :\
>
> Any ideas? I'm pretty sure in the latest instance the guest OS was fully loaded
> and sitting at a login prompt at the time, so it wasn't the lack of the xs-tool
> daemon as far as I can see. Maybe something a process was doing in the
> background on the guest interfered?

Looks like something goes bust in the VM datapath, can you check /var/log/SMlog for exceptions and /var/log/{daemon.log,kern.log,messages,syslog} for any tapdisk errors? (grep for "tap-err" and "segfault")
_______________________________________________
Xen-api mailing list
Xen-api@lists.xen.org
http://lists.xen.org/cgi-bin/mailman/listinfo/xen-api
Re: Snapshot causes disk errors and VM crashes [ In reply to ]
> I couldn't find any tap-segfault messages for the time of the incident, I

Just to clarify, it's either "tap-err" or "segfault", but not "tap-segfault". Also, check /var/log/user.log.

> pastebin'd the SMlog covering that time slot whgen the snapshot was taken
> (it can be seen in the log) but I couldn't find much in the way of exceptions.
>
> http://pastebin.com/fgcZ0T2W

I can't find anything wrong in this log excerpt. Can you post the other logs in /var/log (kern.log, daemon.log, messages, user.log, syslog)?

> If it makes any difference, the SR is mounted via NFS from the local machine.
> Am I doing the wrong thing? I made it NFS accessible to make it available in
> the event of pooling servers, then I can simply attach to the other server as
> well. Should I be mounting local storage locally? Does it make a difference?

That's a bit non-standard but I don't see anything wrong with it.
_______________________________________________
Xen-api mailing list
Xen-api@lists.xen.org
http://lists.xen.org/cgi-bin/mailman/listinfo/xen-api
Re: Snapshot causes disk errors and VM crashes [ In reply to ]
(Please don't drop xen-api from the CC)

> I dropped them all here, the large ones are trimmed to the relevant day only:
>
> https://www.dropbox.com/sh/4v1l141dw7fao3c/AAC8YrMONznv6Wdl0Y0Yy
> ueCa?dl=0
>
> I think the relevant time frame is about 20-11-2014 at 09:20-10:30 - I think the
> snapshot was around 09:35
>
> That's a bit non-standard but I don't see anything wrong with it.

I see lots of errors like the following:

Nov 20 09:39:21 kalimantan tapdisk[1899]: ERROR: errno -14 at vhd_complete: /var/run/sr-mount/34ff5733-1e1d-dc84-137e-95c849222ca4/2f6a71be-c1e7-4463-a77c-0d0e627745a3.vhd: op: 5, lsec: 33456128, secs: 8, nbytes: 4096, blk: 8168, blk_offset: 4294967295

Which most certainly lead to the VM experiencing I/O errors. errno -14 is EFAULT (bad address) which is returned to tapdisk by some fairly low level function, possible some system call, unfortunately there isn't more information related to that.

Can you check your logs for anything of interest around that time?
_______________________________________________
Xen-api mailing list
Xen-api@lists.xen.org
http://lists.xen.org/cgi-bin/mailman/listinfo/xen-api
Re: Snapshot causes disk errors and VM crashes [ In reply to ]
(Sorry, I will try not to. I always set mailing lists up to reply-to the
list, not the poster, perhaps someone can suggest this to the admins!)

Where should I look now? I am relatively new to Xen but am going to be
admin for this system eventually so needo t know these things :)

--

Mark Benson

On Fri, Nov 21, 2014 at 12:47 PM, Thanos Makatos <thanos.makatos@citrix.com>
wrote:

> (Please don't drop xen-api from the CC)
>
> > I dropped them all here, the large ones are trimmed to the relevant day
> only:
> >
> > https://www.dropbox.com/sh/4v1l141dw7fao3c/AAC8YrMONznv6Wdl0Y0Yy
> > ueCa?dl=0
> >
> > I think the relevant time frame is about 20-11-2014 at 09:20-10:30 - I
> think the
> > snapshot was around 09:35
> >
> > That's a bit non-standard but I don't see anything wrong with it.
>
> I see lots of errors like the following:
>
> Nov 20 09:39:21 kalimantan tapdisk[1899]: ERROR: errno -14 at
> vhd_complete:
> /var/run/sr-mount/34ff5733-1e1d-dc84-137e-95c849222ca4/2f6a71be-c1e7-4463-a77c-0d0e627745a3.vhd:
> op: 5, lsec: 33456128, secs: 8, nbytes: 4096, blk: 8168, blk_offset:
> 4294967295
>
> Which most certainly lead to the VM experiencing I/O errors. errno -14 is
> EFAULT (bad address) which is returned to tapdisk by some fairly low level
> function, possible some system call, unfortunately there isn't more
> information related to that.
>
> Can you check your logs for anything of interest around that time?
>
Re: Snapshot causes disk errors and VM crashes [ In reply to ]
> Where should I look now? I am relatively new to Xen but am going to be
> admin for this system eventually so needo t know these things :)

Just check all system logs around that time for anything of potential interest. Also, you could try strace'ing tapdisk when taking a snapshot to see which function call returns EFAULT. Instrumenting this in the SM code (/opt/xensource/sm/blktap.py) would the best solution but you can always do this manually.

> I see lots of errors like the following:
>
> Nov 20 09:39:21 kalimantan tapdisk[1899]: ERROR: errno -14 at
> vhd_complete: /var/run/sr-mount/34ff5733-1e1d-dc84-137e-
> 95c849222ca4/2f6a71be-c1e7-4463-a77c-0d0e627745a3.vhd: op: 5, lsec:
> 33456128, secs: 8, nbytes: 4096, blk: 8168, blk_offset: 4294967295
>
> Which most certainly lead to the VM experiencing I/O errors. errno -14 is
> EFAULT (bad address) which is returned to tapdisk by some fairly low level
> function, possible some system call, unfortunately there isn't more
> information related to that.
>
> Can you check your logs for anything of interest around that time?

_______________________________________________
Xen-api mailing list
Xen-api@lists.xen.org
http://lists.xen.org/cgi-bin/mailman/listinfo/xen-api