Hi,
I think I’ve tracked this down. I believe it’s a bug in the XenServer’s event mechanism, specifically a bug where some shared state causes parallel calls to event.from to interfere with each other. From CloudStack’s point of view this manifests as
* spurious SESSION_INVALID exceptions in waitForTask, which triggers cleanup (Task.destroy), which prevents the VM.start from completing, leaving the VM paused
* empty lists of events being returned in non-timeout cases
I’ve prototyped a fix together with a test case (which fails before and passes after) and made a pull request containing both:
https://github.com/xapi-project/xen-api/pull/1719
I’d appreciate review from xapi experts, particularly Jon Ludlam (cc:d). I’ve also cc:d the main xapi development list.
Cheers,
Dave
On 29 Apr 2014, at 05:15, Mike Tutkowski <mike.tutkowski@solidfire.com> wrote:
> Actually, the only issue I'm noticing now is the SSVM being automatically
> paused shortly after being created (while creating a new cloud).
>
> If I go to XenCenter and forcefully shut the VM down, CloudStack restarts
> it OK.
>
>
> On Mon, Apr 28, 2014 at 7:34 PM, Mike Tutkowski <
> mike.tutkowski@solidfire.com> wrote:
>
>> Figured I'd CC Anthony and Edison to see if they have any input on this
>> (it looks like most of the changes on the relevant file
>> (Xenserver625StorageProcessor.java) were performed by one or the other).
>>
>>
>> On Mon, Apr 28, 2014 at 12:40 PM, Mike Tutkowski <
>> mike.tutkowski@solidfire.com> wrote:
>>
>>> Thanks for the reply, guys.
>>>
>>> Just wanted to point out that this is on 4.4 for me (although the issue
>>> may also be present on master).
>>>
>>> I have a sufficient number of IP addresses for both system and user VMs,
>>> so that should be OK (but good thought, Punith).
>>>
>>> I plan to continue debugging this later this afternoon, but have been in
>>> meetings all morning.
>>>
>>> Thanks!
>>>
>>>
>>> On Mon, Apr 28, 2014 at 10:41 AM, Dave Scott <Dave.Scott@citrix.com>wrote:
>>>
>>>> Hi,
>>>>
>>>> (sorry to reply to my own email!)
>>>>
>>>> On 28 Apr 2014, at 11:42, Dave Scott <Dave.Scott@citrix.com> wrote:
>>>>
>>>>>
>>>>> Hi Mike,
>>>>>
>>>>> On 28 Apr 2014, at 04:44, Mike Tutkowski <mike.tutkowski@solidfire.com>
>>>> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> I recently installed 6.2 with XS62ESP1 and XS62ESP1004 (so that
>>>>>> Xenserver625StorageProcessor would be utilized).
>>>>>>
>>>>>> When I create a cloud from scratch, my SSVM starts up fine, but CPVM
>>>> ends
>>>>>> up in the Paused state. I have to force a shutdown of that VM and then
>>>>>> CloudStack restarts it and it works. This consistently happens. The
>>>> system
>>>>>> VMs are being deployed to the local storage of the one XS host I have
>>>> in my
>>>>>> one and only cluster.
>>>>>>
>>>>>> Any thoughts on that?
>>>>>
>>>>> I'm seeing the same symptom on my test cloud with 6.2 and XS62ESP1004.
>>>> I think there's a problem with XenAPI session and task handling in the
>>>> cloudstack master branch, although I've not tracked it down yet. In my
>>>> management server log I see:
>>>>>
>>>>> WARN [c.c.h.x.r.CitrixResourceBase] (DirectAgent-5:ctx-47dccee1)
>>>> Unable to start VM(v-2-VM) on host(1c4a31e9-469e-45c3-a0ad-9792ac7b
>>>>> 20f6) due to You gave an invalid session reference. It may have been
>>>> invalidated by a server restart, or timed out. You should get
>>>>> a new session handle, using one of the session.login_ calls. This
>>>> error does not invalidate the current connection. The handle para
>>>>> meter echoes the bad value given.
>>>>> You gave an invalid session reference. It may have been invalidated
>>>> by a server restart, or timed out. You should get a new session
>>>>> handle, using one of the session.login_ calls. This error does not
>>>> invalidate the current connection. The handle parameter echoes
>>>>> the bad value given.
>>>>> at com.xensource.xenapi.Types.checkResponse(Types.java:218)
>>>>> at com.xensource.xenapi.Connection.dispatch(Connection.java:395)
>>>>> at
>>>> com.cloud.hypervisor.xen.resource.XenServerConnectionPool$XenServerConnection.dispatch(XenServerConnectionPool.java:463)
>>>>> at com.xensource.xenapi.Event.from(Event.java:270)
>>>>> at
>>>> org.apache.cloudstack.hypervisor.xenserver.XenServerResourceNewBase.waitForTask(XenServerResourceNewBase.java:113)
>>>>> at
>>>> com.cloud.hypervisor.xen.resource.CitrixResourceBase.startVM(CitrixResourceBase.java:3455)
>>>>>
>>>>> Somehow the XenAPI session being used by the Event.from in the
>>>> XenServerResourceNewBase.waitForTask (used for recent 6.2 XenServers only)
>>>> is being logged-out somewhere. When this happens, the cloudstack cleanup
>>>> code calls Task.cancel and Task.destroy, and then the XenServer
>>>> Async.VM.start fails trying to update Task.progress before it internally
>>>> calls VM.unpause.
>>>>>
>>>>> I made a hack to disable caching of Connection/sessions:
>>>>>
>>>>>
>>>> https://github.com/djs55/cloudstack/commit/a388b71279086e42710e26340df0632d0d8135e4
>>>>
>>>> For reference / experimentation, I've made a slightly more plausible
>>>> patch:
>>>>
>>>>
>>>> https://github.com/djs55/cloudstack/commit/9d40f56c6384d04a5f0fb22e5b97530c0164e0b2
>>>>
>>>> It catches the SESSION_INVALID in the XenServerConnection and
>>>> transparently logs back in. This would prevent the higher level bits of the
>>>> XenServer plugin from having to deal with sessions being expired beneath
>>>> them.
>>>>
>>>> Chers,
>>>> Dave
>>>>
>>>>>
>>>>> I suspect this now leaks Connections/sessions, but the symptom goes
>>>> away.
>>>>>
>>>>> So far my thoughts are:
>>>>>
>>>>> 1. we need to find who's calling session.logout and why -- this will
>>>> help fix the problem in the short term
>>>>>
>>>>> 2. The XenServer XenAPI bindings are harder to use than they should be
>>>> (IMHO). In particular I think the bindings should take care of handling
>>>> SESSION_INVALID exceptions and re-authenticating transparently, to avoid
>>>> polluting the cloudstack code with rarely-used exception handlers.
>>>>>
>>>>> 3. the semantics of XenAPI task.destroy could be improved: instead of
>>>> immediately removing the task (which then causes cleanup code to fail
>>>> randomly it seems), it should be more like Unix waitpid with NOHANG i.e.
>>>> set a bit which says, "I'm done with this. Destroy it when you are finished
>>>> with it."
>>>>>
>>>>>
>>>>>>
>>>>>> Also, if I try to kick off a user VM to local storage, I get the
>>>>>> general-purpose InsufficientCapacityException and the virtual router
>>>> does
>>>>>> not even start up.
>>>>>
>>>>> No idea about this one :)
>>>>>
>>>>> Cheers,
>>>>> Dave
>>>>>
>>>>>>
>>>>>> Can anyone create a similar cloud to what I've described here with XS
>>>> 6.2,
>>>>>> XS62ESP1, and XS62ESP1004? I re-ran this test using a XS 6.1 host and
>>>> it
>>>>>> works just fine.
>>>>>>
>>>>>> At the moment, this is blocking a test case I'm trying to execute to
>>>> verify
>>>>>> code I had to write in Xenserver625StorageProcessor.
>>>>>>
>>>>>> Thanks!
>>>>>>
>>>>>> --
>>>>>> *Mike Tutkowski*
>>>>>> *Senior CloudStack Developer, SolidFire Inc.*
>>>>>> e: mike.tutkowski@solidfire.com
>>>>>> o: 303.746.7302
>>>>>> Advancing the way the world uses the
>>>>>> cloud<http://solidfire.com/solution/overview/?video=play>
>>>>>> *(tm)*
>>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> *Mike Tutkowski*
>>> *Senior CloudStack Developer, SolidFire Inc.*
>>> e: mike.tutkowski@solidfire.com
>>> o: 303.746.7302
>>> Advancing the way the world uses the cloud<http://solidfire.com/solution/overview/?video=play>
>>> *(tm)*
>>>
>>
>>
>>
>> --
>> *Mike Tutkowski*
>> *Senior CloudStack Developer, SolidFire Inc.*
>> e: mike.tutkowski@solidfire.com
>> o: 303.746.7302
>> Advancing the way the world uses the cloud<http://solidfire.com/solution/overview/?video=play>
>> *(tm)*
>>
>
>
>
> --
> *Mike Tutkowski*
> *Senior CloudStack Developer, SolidFire Inc.*
> e: mike.tutkowski@solidfire.com
> o: 303.746.7302
> Advancing the way the world uses the
> cloud<http://solidfire.com/solution/overview/?video=play>
> *(tm)*
_______________________________________________
Xen-api mailing list
Xen-api@lists.xen.org
http://lists.xen.org/cgi-bin/mailman/listinfo/xen-api
I think I’ve tracked this down. I believe it’s a bug in the XenServer’s event mechanism, specifically a bug where some shared state causes parallel calls to event.from to interfere with each other. From CloudStack’s point of view this manifests as
* spurious SESSION_INVALID exceptions in waitForTask, which triggers cleanup (Task.destroy), which prevents the VM.start from completing, leaving the VM paused
* empty lists of events being returned in non-timeout cases
I’ve prototyped a fix together with a test case (which fails before and passes after) and made a pull request containing both:
https://github.com/xapi-project/xen-api/pull/1719
I’d appreciate review from xapi experts, particularly Jon Ludlam (cc:d). I’ve also cc:d the main xapi development list.
Cheers,
Dave
On 29 Apr 2014, at 05:15, Mike Tutkowski <mike.tutkowski@solidfire.com> wrote:
> Actually, the only issue I'm noticing now is the SSVM being automatically
> paused shortly after being created (while creating a new cloud).
>
> If I go to XenCenter and forcefully shut the VM down, CloudStack restarts
> it OK.
>
>
> On Mon, Apr 28, 2014 at 7:34 PM, Mike Tutkowski <
> mike.tutkowski@solidfire.com> wrote:
>
>> Figured I'd CC Anthony and Edison to see if they have any input on this
>> (it looks like most of the changes on the relevant file
>> (Xenserver625StorageProcessor.java) were performed by one or the other).
>>
>>
>> On Mon, Apr 28, 2014 at 12:40 PM, Mike Tutkowski <
>> mike.tutkowski@solidfire.com> wrote:
>>
>>> Thanks for the reply, guys.
>>>
>>> Just wanted to point out that this is on 4.4 for me (although the issue
>>> may also be present on master).
>>>
>>> I have a sufficient number of IP addresses for both system and user VMs,
>>> so that should be OK (but good thought, Punith).
>>>
>>> I plan to continue debugging this later this afternoon, but have been in
>>> meetings all morning.
>>>
>>> Thanks!
>>>
>>>
>>> On Mon, Apr 28, 2014 at 10:41 AM, Dave Scott <Dave.Scott@citrix.com>wrote:
>>>
>>>> Hi,
>>>>
>>>> (sorry to reply to my own email!)
>>>>
>>>> On 28 Apr 2014, at 11:42, Dave Scott <Dave.Scott@citrix.com> wrote:
>>>>
>>>>>
>>>>> Hi Mike,
>>>>>
>>>>> On 28 Apr 2014, at 04:44, Mike Tutkowski <mike.tutkowski@solidfire.com>
>>>> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> I recently installed 6.2 with XS62ESP1 and XS62ESP1004 (so that
>>>>>> Xenserver625StorageProcessor would be utilized).
>>>>>>
>>>>>> When I create a cloud from scratch, my SSVM starts up fine, but CPVM
>>>> ends
>>>>>> up in the Paused state. I have to force a shutdown of that VM and then
>>>>>> CloudStack restarts it and it works. This consistently happens. The
>>>> system
>>>>>> VMs are being deployed to the local storage of the one XS host I have
>>>> in my
>>>>>> one and only cluster.
>>>>>>
>>>>>> Any thoughts on that?
>>>>>
>>>>> I'm seeing the same symptom on my test cloud with 6.2 and XS62ESP1004.
>>>> I think there's a problem with XenAPI session and task handling in the
>>>> cloudstack master branch, although I've not tracked it down yet. In my
>>>> management server log I see:
>>>>>
>>>>> WARN [c.c.h.x.r.CitrixResourceBase] (DirectAgent-5:ctx-47dccee1)
>>>> Unable to start VM(v-2-VM) on host(1c4a31e9-469e-45c3-a0ad-9792ac7b
>>>>> 20f6) due to You gave an invalid session reference. It may have been
>>>> invalidated by a server restart, or timed out. You should get
>>>>> a new session handle, using one of the session.login_ calls. This
>>>> error does not invalidate the current connection. The handle para
>>>>> meter echoes the bad value given.
>>>>> You gave an invalid session reference. It may have been invalidated
>>>> by a server restart, or timed out. You should get a new session
>>>>> handle, using one of the session.login_ calls. This error does not
>>>> invalidate the current connection. The handle parameter echoes
>>>>> the bad value given.
>>>>> at com.xensource.xenapi.Types.checkResponse(Types.java:218)
>>>>> at com.xensource.xenapi.Connection.dispatch(Connection.java:395)
>>>>> at
>>>> com.cloud.hypervisor.xen.resource.XenServerConnectionPool$XenServerConnection.dispatch(XenServerConnectionPool.java:463)
>>>>> at com.xensource.xenapi.Event.from(Event.java:270)
>>>>> at
>>>> org.apache.cloudstack.hypervisor.xenserver.XenServerResourceNewBase.waitForTask(XenServerResourceNewBase.java:113)
>>>>> at
>>>> com.cloud.hypervisor.xen.resource.CitrixResourceBase.startVM(CitrixResourceBase.java:3455)
>>>>>
>>>>> Somehow the XenAPI session being used by the Event.from in the
>>>> XenServerResourceNewBase.waitForTask (used for recent 6.2 XenServers only)
>>>> is being logged-out somewhere. When this happens, the cloudstack cleanup
>>>> code calls Task.cancel and Task.destroy, and then the XenServer
>>>> Async.VM.start fails trying to update Task.progress before it internally
>>>> calls VM.unpause.
>>>>>
>>>>> I made a hack to disable caching of Connection/sessions:
>>>>>
>>>>>
>>>> https://github.com/djs55/cloudstack/commit/a388b71279086e42710e26340df0632d0d8135e4
>>>>
>>>> For reference / experimentation, I've made a slightly more plausible
>>>> patch:
>>>>
>>>>
>>>> https://github.com/djs55/cloudstack/commit/9d40f56c6384d04a5f0fb22e5b97530c0164e0b2
>>>>
>>>> It catches the SESSION_INVALID in the XenServerConnection and
>>>> transparently logs back in. This would prevent the higher level bits of the
>>>> XenServer plugin from having to deal with sessions being expired beneath
>>>> them.
>>>>
>>>> Chers,
>>>> Dave
>>>>
>>>>>
>>>>> I suspect this now leaks Connections/sessions, but the symptom goes
>>>> away.
>>>>>
>>>>> So far my thoughts are:
>>>>>
>>>>> 1. we need to find who's calling session.logout and why -- this will
>>>> help fix the problem in the short term
>>>>>
>>>>> 2. The XenServer XenAPI bindings are harder to use than they should be
>>>> (IMHO). In particular I think the bindings should take care of handling
>>>> SESSION_INVALID exceptions and re-authenticating transparently, to avoid
>>>> polluting the cloudstack code with rarely-used exception handlers.
>>>>>
>>>>> 3. the semantics of XenAPI task.destroy could be improved: instead of
>>>> immediately removing the task (which then causes cleanup code to fail
>>>> randomly it seems), it should be more like Unix waitpid with NOHANG i.e.
>>>> set a bit which says, "I'm done with this. Destroy it when you are finished
>>>> with it."
>>>>>
>>>>>
>>>>>>
>>>>>> Also, if I try to kick off a user VM to local storage, I get the
>>>>>> general-purpose InsufficientCapacityException and the virtual router
>>>> does
>>>>>> not even start up.
>>>>>
>>>>> No idea about this one :)
>>>>>
>>>>> Cheers,
>>>>> Dave
>>>>>
>>>>>>
>>>>>> Can anyone create a similar cloud to what I've described here with XS
>>>> 6.2,
>>>>>> XS62ESP1, and XS62ESP1004? I re-ran this test using a XS 6.1 host and
>>>> it
>>>>>> works just fine.
>>>>>>
>>>>>> At the moment, this is blocking a test case I'm trying to execute to
>>>> verify
>>>>>> code I had to write in Xenserver625StorageProcessor.
>>>>>>
>>>>>> Thanks!
>>>>>>
>>>>>> --
>>>>>> *Mike Tutkowski*
>>>>>> *Senior CloudStack Developer, SolidFire Inc.*
>>>>>> e: mike.tutkowski@solidfire.com
>>>>>> o: 303.746.7302
>>>>>> Advancing the way the world uses the
>>>>>> cloud<http://solidfire.com/solution/overview/?video=play>
>>>>>> *(tm)*
>>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> *Mike Tutkowski*
>>> *Senior CloudStack Developer, SolidFire Inc.*
>>> e: mike.tutkowski@solidfire.com
>>> o: 303.746.7302
>>> Advancing the way the world uses the cloud<http://solidfire.com/solution/overview/?video=play>
>>> *(tm)*
>>>
>>
>>
>>
>> --
>> *Mike Tutkowski*
>> *Senior CloudStack Developer, SolidFire Inc.*
>> e: mike.tutkowski@solidfire.com
>> o: 303.746.7302
>> Advancing the way the world uses the cloud<http://solidfire.com/solution/overview/?video=play>
>> *(tm)*
>>
>
>
>
> --
> *Mike Tutkowski*
> *Senior CloudStack Developer, SolidFire Inc.*
> e: mike.tutkowski@solidfire.com
> o: 303.746.7302
> Advancing the way the world uses the
> cloud<http://solidfire.com/solution/overview/?video=play>
> *(tm)*
_______________________________________________
Xen-api mailing list
Xen-api@lists.xen.org
http://lists.xen.org/cgi-bin/mailman/listinfo/xen-api