Mailing List Archive

XENBUS: Timeout connecting to device errors
We've been noticing a lot of these errors when booting VMs since we
moved to 3.0.3 - I've traced this to the hotplug scripts in Dom0 taking
>10s to run to completion and specifically the vif-bridge script taking
>=9s to plug the vif into the s/w bridge on occasion - was wondering if
anyone has any insight into why it might take this long.

I added some instrumentation to the scripts to log entry/exit from
xen-backend.agent and also lock contention (attached at the end of this)
and have the following observations:

1. Currently, the various script invocations are issued in parallel but
are serialized
by a single global lock -- is it really necessary, for example, to
serialize vif
and vbd hot plug processing in Dom0?

2. In most cases we've seen, this problem happens when the first VM is
started after
re-installing a box. In the example below, the 'vif online'
processing started at
2:21:53 and did not finish until 2:22:04

3. Clearly a hard coded timeout of 10s is less than perfect -- is there
no better way of knowing
when the hotplug processing is done?

Thanks,
Simon

<dom0 /var/log/messages:>

Dec 4 02:21:53 gromit xen-hotplug: /etc/hotplug/xen-backend.agent:
xen-backend[20234]: Start vif: add
Dec 4 02:21:53 gromit xen-hotplug: /etc/hotplug/xen-backend.agent:
xen-backend[20234]: End vif: add
Dec 4 02:21:53 gromit xen-hotplug: /etc/hotplug/xen-backend.agent:
xen-backend[20240]: Start vif: online
Dec 4 02:21:53 gromit xen-hotplug: /etc/hotplug/xen-backend.agent:
xen-backend[20252]: Start vbd: add
Dec 4 02:21:53 gromit xen-hotplug: /etc/hotplug/xen-backend.agent: Lock
/var/run/xen-hotplug/xenbus_hotplug_global by 20252 - currently owned by
20240: /etc/hotplug/xen-backend.agent
Dec 4 02:21:54 gromit lvm[12123]: XenDom wallace1: state changed
stopped => paused
Dec 4 02:21:54 gromit sn2spine: start RESULT <?xml version="1.0" ?>
<result status='ok' code='200'> <guest
id="3f879d14-8c70-48af-ae02-88df3afad3cb"><name>wallace1</name><id>3f879
d14-8c70-48af-ae02-88df3afad3cb</id><system>gromit.sn.stratus.com</syste
m><state>starting</state><availability>failover</availability><mode>dupl
ex</mode><memory>256</memory><cpus>1</cpus><storage><volume
device="hda1" mountpoint="/" name="drbd0"/></storage></guest> </result>

Dec 4 02:21:54 gromit xen-hotplug: /etc/xen/scripts/vif-bridge: online
XENBUS_PATH=backend/vif/1/0
Dec 4 02:21:54 gromit kernel: device vif1.0 entered promiscuous mode
Dec 4 02:21:54 gromit xen-hotplug: /etc/xen/scripts/vif-bridge:
iptables -A FORWARD -m physdev --physdev-in vif1.0 -j ACCEPT failed. If
you are using iptables, this may affect networking for guest domains.
Dec 4 02:21:55 gromit kernel: xenbr0: port 3(vif1.0) entering learning
state
Dec 4 02:21:59 gromit kernel: xenbr0: topology change detected,
propagating
Dec 4 02:22:03 gromit kernel: xenbr0: port 3(vif1.0) entering
forwarding state
Dec 4 02:22:04 gromit lvm[12123]: XenDom wallace1: state changed paused
=> running
Dec 4 02:22:04 gromit xen-hotplug: /etc/hotplug/xen-backend.agent:
xen-backend[20240]: End vif: online
Dec 4 02:22:18 gromit kernel: ip_tables: (C) 2000-2006 Netfilter Core
Team
Dec 4 02:22:22 gromit xen-hotplug: /etc/xen/scripts/block: add
XENBUS_PATH=backend/vbd/1/769
Dec 4 02:22:26 gromit xen-hotplug: /etc/hotplug/xen-backend.agent:
xen-backend[20252]: End vbd: add
Dec 4 02:22:29 gromit lvm[12123]: XenDom wallace1: state changed
running => crashed


<guest console>:
Dec 4 02:21:54 Linux version 2.6.16.29-xenU
(sntriage@anna.sn.stratus.com) (gcc version 3.4.4 20050721 (Red Hat
3.4.4-2)) #1 SMP Mon Dec 4 01:33:25 EST 2006
...
Dec 4 02:22:04 XENBUS: Timeout connecting to device: device/vbd/769
(state 3)
Dec 4 02:22:04 Root-NFS: No NFS server available, giving up.
Dec 4 02:22:04 VFS: Unable to mount root fs via NFS, trying floppy.
Dec 4 02:22:04 VFS: Cannot open root device "hda1" or
unknown-block(2,0)
Dec 4 02:22:04 Please append a correct "root=" boot option
Dec 4 02:22:04 Kernel panic - not syncing: VFS: Unable to mount root fs
on unknown-block(2,0)

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel
Re: XENBUS: Timeout connecting to device errors [ In reply to ]
On Mon, Dec 04, 2006 at 02:18:37PM -0500, Graham, Simon wrote:

> We've been noticing a lot of these errors when booting VMs since we
> moved to 3.0.3 - I've traced this to the hotplug scripts in Dom0 taking
> >10s to run to completion and specifically the vif-bridge script taking
> >=9s to plug the vif into the s/w bridge on occasion - was wondering if
> anyone has any insight into why it might take this long.
>
> I added some instrumentation to the scripts to log entry/exit from
> xen-backend.agent and also lock contention (attached at the end of this)
> and have the following observations:
>
> 1. Currently, the various script invocations are issued in parallel but
> are serialized
> by a single global lock -- is it really necessary, for example, to
> serialize vif
> and vbd hot plug processing in Dom0?

You need to serialise VBD hotplug if you are going to get the right result
when performing the sharing check. If you're using vif-nat, you need to
serialise the modifications to the DHCP configuration file. Other than that,
I don't think that there's a need to serialise events at startup. On Bugzilla
#515, Harry Butterworth notes that there is a race condition in teardown,
which is why he introduced the global lock. You could make this cleverer,
possibly, so that it doesn't affect startup times.

All that said, I believe that udev is supposed to serialise all events
anyway, so unless you're using hotplug rather than udev, I'd expect you to see
no lock contention whatsoever.

> 2. In most cases we've seen, this problem happens when the first VM is
> started after
> re-installing a box. In the example below, the 'vif online'
> processing started at
> 2:21:53 and did not finish until 2:22:04
>
> 3. Clearly a hard coded timeout of 10s is less than perfect -- is there
> no better way of knowing
> when the hotplug processing is done?

We know precisely when hotplugging is done -- the scripts write an entry into
the store to tell us so. It's knowing when they've locked up that's the hard
bit.

If you're seeing vif bringup taking 9 seconds, then clearly the 10 second
timeout is far too short. There's no particular reason to keep the timeout
short, so feel free to lengthen it, with the obvious consequences. Bear in
mind that Xend will time out the whole device bringup phase after 100 seconds.
I'd want to root-cause the 9 second bringup as well, as I don't see why it
ought to take that long.

Cheers,

Ewan.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel
RE: XENBUS: Timeout connecting to device errors [ In reply to ]
>
> All that said, I believe that udev is supposed to serialise all events
> anyway, so unless you're using hotplug rather than udev, I'd expect
you
> to see
> no lock contention whatsoever.
>

We're using hotplug (& a RHEL4 based Dom0) -- it seems to issue all the
hotplug script invocations at the same time.

> > 3. Clearly a hard coded timeout of 10s is less than perfect -- is
> there
> > no better way of knowing
> > when the hotplug processing is done?
>
> We know precisely when hotplugging is done -- the scripts write an
> entry into
> the store to tell us so. It's knowing when they've locked up that's
> the hard
> bit.
>

Of course - silly me!

> If you're seeing vif bringup taking 9 seconds, then clearly the 10
> second
> timeout is far too short. There's no particular reason to keep the
> timeout
> short, so feel free to lengthen it, with the obvious consequences.
> Bear in
> mind that Xend will time out the whole device bringup phase after 100
> seconds.
> I'd want to root-cause the 9 second bringup as well, as I don't see
why
> it
> ought to take that long.
>

So, just for grins, I increased the timeout to 20s since that seemed to
be plenty of time based on the log files I've seen -- the timeout still
happens and now the hotplug scripts seem to take ~20s to run... I'm
beginning to think that there is some resource contention between the
domain that is starting up and dom0...

All suggestions welcome!
Simon


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel