Mailing List Archive

Re: [ACS4.4, XenServer] Problem starting system VMs
Hi,

I think I’ve tracked this down. I believe it’s a bug in the XenServer’s event mechanism, specifically a bug where some shared state causes parallel calls to event.from to interfere with each other. From CloudStack’s point of view this manifests as

* spurious SESSION_INVALID exceptions in waitForTask, which triggers cleanup (Task.destroy), which prevents the VM.start from completing, leaving the VM paused
* empty lists of events being returned in non-timeout cases

I’ve prototyped a fix together with a test case (which fails before and passes after) and made a pull request containing both:

https://github.com/xapi-project/xen-api/pull/1719

I’d appreciate review from xapi experts, particularly Jon Ludlam (cc:d). I’ve also cc:d the main xapi development list.

Cheers,
Dave

On 29 Apr 2014, at 05:15, Mike Tutkowski <mike.tutkowski@solidfire.com> wrote:

> Actually, the only issue I'm noticing now is the SSVM being automatically
> paused shortly after being created (while creating a new cloud).
>
> If I go to XenCenter and forcefully shut the VM down, CloudStack restarts
> it OK.
>
>
> On Mon, Apr 28, 2014 at 7:34 PM, Mike Tutkowski <
> mike.tutkowski@solidfire.com> wrote:
>
>> Figured I'd CC Anthony and Edison to see if they have any input on this
>> (it looks like most of the changes on the relevant file
>> (Xenserver625StorageProcessor.java) were performed by one or the other).
>>
>>
>> On Mon, Apr 28, 2014 at 12:40 PM, Mike Tutkowski <
>> mike.tutkowski@solidfire.com> wrote:
>>
>>> Thanks for the reply, guys.
>>>
>>> Just wanted to point out that this is on 4.4 for me (although the issue
>>> may also be present on master).
>>>
>>> I have a sufficient number of IP addresses for both system and user VMs,
>>> so that should be OK (but good thought, Punith).
>>>
>>> I plan to continue debugging this later this afternoon, but have been in
>>> meetings all morning.
>>>
>>> Thanks!
>>>
>>>
>>> On Mon, Apr 28, 2014 at 10:41 AM, Dave Scott <Dave.Scott@citrix.com>wrote:
>>>
>>>> Hi,
>>>>
>>>> (sorry to reply to my own email!)
>>>>
>>>> On 28 Apr 2014, at 11:42, Dave Scott <Dave.Scott@citrix.com> wrote:
>>>>
>>>>>
>>>>> Hi Mike,
>>>>>
>>>>> On 28 Apr 2014, at 04:44, Mike Tutkowski <mike.tutkowski@solidfire.com>
>>>> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> I recently installed 6.2 with XS62ESP1 and XS62ESP1004 (so that
>>>>>> Xenserver625StorageProcessor would be utilized).
>>>>>>
>>>>>> When I create a cloud from scratch, my SSVM starts up fine, but CPVM
>>>> ends
>>>>>> up in the Paused state. I have to force a shutdown of that VM and then
>>>>>> CloudStack restarts it and it works. This consistently happens. The
>>>> system
>>>>>> VMs are being deployed to the local storage of the one XS host I have
>>>> in my
>>>>>> one and only cluster.
>>>>>>
>>>>>> Any thoughts on that?
>>>>>
>>>>> I'm seeing the same symptom on my test cloud with 6.2 and XS62ESP1004.
>>>> I think there's a problem with XenAPI session and task handling in the
>>>> cloudstack master branch, although I've not tracked it down yet. In my
>>>> management server log I see:
>>>>>
>>>>> WARN [c.c.h.x.r.CitrixResourceBase] (DirectAgent-5:ctx-47dccee1)
>>>> Unable to start VM(v-2-VM) on host(1c4a31e9-469e-45c3-a0ad-9792ac7b
>>>>> 20f6) due to You gave an invalid session reference. It may have been
>>>> invalidated by a server restart, or timed out. You should get
>>>>> a new session handle, using one of the session.login_ calls. This
>>>> error does not invalidate the current connection. The handle para
>>>>> meter echoes the bad value given.
>>>>> You gave an invalid session reference. It may have been invalidated
>>>> by a server restart, or timed out. You should get a new session
>>>>> handle, using one of the session.login_ calls. This error does not
>>>> invalidate the current connection. The handle parameter echoes
>>>>> the bad value given.
>>>>> at com.xensource.xenapi.Types.checkResponse(Types.java:218)
>>>>> at com.xensource.xenapi.Connection.dispatch(Connection.java:395)
>>>>> at
>>>> com.cloud.hypervisor.xen.resource.XenServerConnectionPool$XenServerConnection.dispatch(XenServerConnectionPool.java:463)
>>>>> at com.xensource.xenapi.Event.from(Event.java:270)
>>>>> at
>>>> org.apache.cloudstack.hypervisor.xenserver.XenServerResourceNewBase.waitForTask(XenServerResourceNewBase.java:113)
>>>>> at
>>>> com.cloud.hypervisor.xen.resource.CitrixResourceBase.startVM(CitrixResourceBase.java:3455)
>>>>>
>>>>> Somehow the XenAPI session being used by the Event.from in the
>>>> XenServerResourceNewBase.waitForTask (used for recent 6.2 XenServers only)
>>>> is being logged-out somewhere. When this happens, the cloudstack cleanup
>>>> code calls Task.cancel and Task.destroy, and then the XenServer
>>>> Async.VM.start fails trying to update Task.progress before it internally
>>>> calls VM.unpause.
>>>>>
>>>>> I made a hack to disable caching of Connection/sessions:
>>>>>
>>>>>
>>>> https://github.com/djs55/cloudstack/commit/a388b71279086e42710e26340df0632d0d8135e4
>>>>
>>>> For reference / experimentation, I've made a slightly more plausible
>>>> patch:
>>>>
>>>>
>>>> https://github.com/djs55/cloudstack/commit/9d40f56c6384d04a5f0fb22e5b97530c0164e0b2
>>>>
>>>> It catches the SESSION_INVALID in the XenServerConnection and
>>>> transparently logs back in. This would prevent the higher level bits of the
>>>> XenServer plugin from having to deal with sessions being expired beneath
>>>> them.
>>>>
>>>> Chers,
>>>> Dave
>>>>
>>>>>
>>>>> I suspect this now leaks Connections/sessions, but the symptom goes
>>>> away.
>>>>>
>>>>> So far my thoughts are:
>>>>>
>>>>> 1. we need to find who's calling session.logout and why -- this will
>>>> help fix the problem in the short term
>>>>>
>>>>> 2. The XenServer XenAPI bindings are harder to use than they should be
>>>> (IMHO). In particular I think the bindings should take care of handling
>>>> SESSION_INVALID exceptions and re-authenticating transparently, to avoid
>>>> polluting the cloudstack code with rarely-used exception handlers.
>>>>>
>>>>> 3. the semantics of XenAPI task.destroy could be improved: instead of
>>>> immediately removing the task (which then causes cleanup code to fail
>>>> randomly it seems), it should be more like Unix waitpid with NOHANG i.e.
>>>> set a bit which says, "I'm done with this. Destroy it when you are finished
>>>> with it."
>>>>>
>>>>>
>>>>>>
>>>>>> Also, if I try to kick off a user VM to local storage, I get the
>>>>>> general-purpose InsufficientCapacityException and the virtual router
>>>> does
>>>>>> not even start up.
>>>>>
>>>>> No idea about this one :)
>>>>>
>>>>> Cheers,
>>>>> Dave
>>>>>
>>>>>>
>>>>>> Can anyone create a similar cloud to what I've described here with XS
>>>> 6.2,
>>>>>> XS62ESP1, and XS62ESP1004? I re-ran this test using a XS 6.1 host and
>>>> it
>>>>>> works just fine.
>>>>>>
>>>>>> At the moment, this is blocking a test case I'm trying to execute to
>>>> verify
>>>>>> code I had to write in Xenserver625StorageProcessor.
>>>>>>
>>>>>> Thanks!
>>>>>>
>>>>>> --
>>>>>> *Mike Tutkowski*
>>>>>> *Senior CloudStack Developer, SolidFire Inc.*
>>>>>> e: mike.tutkowski@solidfire.com
>>>>>> o: 303.746.7302
>>>>>> Advancing the way the world uses the
>>>>>> cloud<http://solidfire.com/solution/overview/?video=play>
>>>>>> *(tm)*
>>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> *Mike Tutkowski*
>>> *Senior CloudStack Developer, SolidFire Inc.*
>>> e: mike.tutkowski@solidfire.com
>>> o: 303.746.7302
>>> Advancing the way the world uses the cloud<http://solidfire.com/solution/overview/?video=play>
>>> *(tm)*
>>>
>>
>>
>>
>> --
>> *Mike Tutkowski*
>> *Senior CloudStack Developer, SolidFire Inc.*
>> e: mike.tutkowski@solidfire.com
>> o: 303.746.7302
>> Advancing the way the world uses the cloud<http://solidfire.com/solution/overview/?video=play>
>> *(tm)*
>>
>
>
>
> --
> *Mike Tutkowski*
> *Senior CloudStack Developer, SolidFire Inc.*
> e: mike.tutkowski@solidfire.com
> o: 303.746.7302
> Advancing the way the world uses the
> cloud<http://solidfire.com/solution/overview/?video=play>
> *(tm)*


_______________________________________________
Xen-api mailing list
Xen-api@lists.xen.org
http://lists.xen.org/cgi-bin/mailman/listinfo/xen-api
Re: [ACS4.4, XenServer] Problem starting system VMs [ In reply to ]
On 01/05/14 11:35, Dave Scott wrote:
> Hi,
>
> I think I’ve tracked this down. I believe it’s a bug in the XenServer’s event mechanism, specifically a bug where some shared state causes parallel calls to event.from to interfere with each other. From CloudStack’s point of view this manifests as
>
> * spurious SESSION_INVALID exceptions in waitForTask, which triggers cleanup (Task.destroy), which prevents the VM.start from completing, leaving the VM paused
> * empty lists of events being returned in non-timeout cases
>
> I’ve prototyped a fix together with a test case (which fails before and passes after) and made a pull request containing both:
>
> https://github.com/xapi-project/xen-api/pull/1719

Pull request looks very nice. Your second bullet point was due to the
fact that the autogenerated code couldn't cope with the immutable
database being passed in, so we're generating the snapshots from the
live db. I believe this has now changed and we can associate a database
snapshot with a context, so we could make that problem go away
completely rather than looping until the problem doesn't happen :-)

I think the snapshots fix is a nice-to-have though, so if you could make
a PR for master rather than the clearwater branch, I'll merge.

Jon


> I’d appreciate review from xapi experts, particularly Jon Ludlam (cc:d). I’ve also cc:d the main xapi development list.
>
> Cheers,
> Dave
>


_______________________________________________
Xen-api mailing list
Xen-api@lists.xen.org
http://lists.xen.org/cgi-bin/mailman/listinfo/xen-api