Mailing List Archive: Re: [Pacemaker] migration fix for ocf:heartbeat:Xen

Hi,

On Thu, Aug 11, 2011 at 09:07:00PM +0000, Daugherity, Andrew W wrote:
> I have discovered that sometimes when migrating a VM, the migration itself will succeed, but the migrate_from call on the target node will fail, as apparently the status hasn't settled down yet. This is more likely to happen when stopping pacemaker on a node, causing all its VMs to migrate away. Migration succeeds, but then (sometimes) the status call in migrate_from fails, and the VM is unnecessarily stopped and started. Note that it is NOT a timeout problem, as the migrate_from operation (which only checks status) takes less than a second.
>
> I noticed the VirtualDomain RA does a loop rather than just checking the status once as the Xen RA does, so I patched a similar thing into the Xen RA, and that solved my problem. I don't know what the appropriate time to wait is, so I copied the "2/3 of the timeout" logic from the stop function. Perhaps the full timeout would be more appropriate?

I guess so too. The patch below looks OK to me since migrate_from
shouldn't take much time. But I don't have enough experience with
migrations. Can somebody please take a look too. Moving
discussion to linux-ha-dev.

Many thanks for the patch!

Dejan

> Patch:
> --- Xen.orig 2011-08-11 12:54:13.574795799 -0500
> +++ Xen 2011-08-11 15:18:58.352851276 -0500
> @@ -384,6 +384,20 @@
> }
>
> Xen_Migrate_From() {
> + if [ -n "$OCF_RESKEY_CRM_meta_timeout" ]; then
> + # Allow 2/3 of the action timeout for status to stabilize
> + # (The origin unit is ms, hence the conversion)
> + timeout=$((OCF_RESKEY_CRM_meta_timeout/1500))
> + else
> + timeout=10 # should be plenty
> + fi
> +
> + while ! Xen_Status ${DOMAIN_NAME} && [ $timeout -gt 0 ]; do
> + ocf_log debug "$DOMAIN_NAME: Not yet active locally, waiting (timeout: ${timeout}s)"
> + timeout=$((timeout-1))
> + sleep 1
> + done
> +
> if Xen_Status ${DOMAIN_NAME}; then
> Xen_Adjust_Memory 0
> ocf_log info "$DOMAIN_NAME: Active locally, migration successful"
> === end patch ===
>
>
> Example resource definition:
> primitive vm-build11 ocf:heartbeat:Xen \
> params xmfile="/mnt/xen/vm/build11" \
> meta allow-migrate="true" target-role="Started" is-managed="true" \
> utilization cores="1" mem="512" \
> op start interval="0" timeout="60" \
> op stop interval="0" timeout="60" \
> op migrate_to interval="0" timeout="180" \
> op monitor interval="30" timeout="30" start-delay="60"
>
> Debug log excerpts (grep build11):
>
> source node (app-02)
> ====
> Aug 11 12:08:33 app-02 lrmd: [9952]: debug: on_msg_perform_op: add an operation operation migrate_to[50] on ocf::Xen::vm-build11 for client 9955, its parameters: CRM_meta_migrate_source=[app-02] CRM_meta_migrate_target=[app-04] CRM_meta_record_pending=[false] CRM_meta_name=[migrate_to] CRM_meta_timeout=[180000] crm_feature_set=[3.0.5] xmfile=[/mnt/xen/vm/build11] to the operation list.
> Aug 11 12:08:33 app-02 lrmd: [9952]: info: rsc:vm-build11:50: migrate_to
> Aug 11 12:08:33 app-02 crmd: [9955]: info: process_lrm_event: LRM operation vm-build11_monitor_30000 (call=49, status=1, cib-update=0, confirmed=true) Cancelled
> Aug 11 12:08:33 app-02 Xen[29909]: [29984]: INFO: build11: Starting xm migrate to app-04
> Aug 11 12:08:43 app-02 Xen[29909]: [30228]: INFO: build11: xm migrate to app-04 succeeded.
> Aug 11 12:08:43 app-02 lrmd: [9952]: info: Managed vm-build11:migrate_to process 29909 exited with return code 0.
> Aug 11 12:08:43 app-02 crmd: [9955]: debug: create_operation_update: do_update_resource: Updating resouce vm-build11 after complete migrate_to op (interval=0)
> Aug 11 12:08:43 app-02 crmd: [9955]: info: process_lrm_event: LRM operation vm-build11_migrate_to_0 (call=50, rc=0, cib-update=75, confirmed=true) ok
> Aug 11 12:08:43 app-02 crmd: [9955]: info: do_lrm_rsc_op: Performing key=141:69:0:478c5322-daa4-462a-a333-6d26288fa416 op=vm-build11_stop_0 )
> Aug 11 12:08:43 app-02 lrmd: [9952]: debug: on_msg_perform_op: add an operation operation stop[52] on ocf::Xen::vm-build11 for client 9955, its parameters: crm_feature_set=[3.0.5] to the operation list.
> Aug 11 12:08:43 app-02 lrmd: [9952]: info: rsc:vm-build11:52: stop
> Aug 11 12:08:44 app-02 Xen[30381]: [30469]: INFO: Xen domain build11 already stopped.
> Aug 11 12:08:44 app-02 lrmd: [9952]: info: RA output: (vm-build11:stop:stderr) Error:
> Aug 11 12:08:44 app-02 lrmd: [9952]: info: RA output: (vm-build11:stop:stderr)
> Aug 11 12:08:44 app-02 lrmd: [9952]: info: RA output: (vm-build11:stop:stderr) Domain 'build11' does not exist.
> Aug 11 12:08:44 app-02 lrmd: [9952]: info: RA output: (vm-build11:stop:stderr)
> Aug 11 12:08:44 app-02 lrmd: [9952]: info: Managed vm-build11:stop process 30381 exited with return code 0.
> Aug 11 12:08:44 app-02 crmd: [9955]: debug: create_operation_update: do_update_resource: Updating resouce vm-build11 after complete stop op (interval=0)
> Aug 11 12:08:44 app-02 crmd: [9955]: info: process_lrm_event: LRM operation vm-build11_stop_0 (call=52, rc=0, cib-update=78, confirmed=true) ok
> ====
> (I'm guessing the "already stopped" after migrate_to succeeds is not a problem.)
>
>
> target node (app-04):
> ====
> Aug 11 12:08:43 app-04 crmd: [9943]: info: do_lrm_rsc_op: Performing key=200:68:0:478c5322-daa4-462a-a333-6d26288fa416 op=vm-build11_migrate_from_0 )
> Aug 11 12:08:43 app-04 lrmd: [9940]: info: rsc:vm-build11:88: migrate_from
> Aug 11 12:08:43 app-04 Xen[1574]: [1619]: ERROR: build11: Not active locally, migration failed!
> Aug 11 12:08:43 app-04 crmd: [9943]: info: process_lrm_event: LRM operation vm-build11_migrate_from_0 (call=88, rc=1, cib-update=144, confirmed=true) unknown error
> Aug 11 12:08:43 app-04 crmd: [9943]: info: do_lrm_rsc_op: Performing key=16:69:0:478c5322-daa4-462a-a333-6d26288fa416 op=vm-build11_stop_0 )
> Aug 11 12:08:43 app-04 lrmd: [9940]: info: rsc:vm-build11:89: stop
> Aug 11 12:08:44 app-04 Xen[1744]: [1852]: INFO: Xen domain build11 will be stopped (timeout: 20s)
> Aug 11 12:08:51 app-04 Xen[1744]: [2146]: INFO: Xen domain build11 stopped.
> Aug 11 12:08:51 app-04 crmd: [9943]: info: process_lrm_event: LRM operation vm-build11_stop_0 (call=89, rc=0, cib-update=145, confirmed=true) ok
> Aug 11 12:09:06 app-04 crmd: [9943]: info: do_lrm_rsc_op: Performing key=142:69:0:478c5322-daa4-462a-a333-6d26288fa416 op=vm-build11_start_0 )
> Aug 11 12:09:06 app-04 lrmd: [9940]: info: rsc:vm-build11:91: start
> Aug 11 12:09:10 app-04 lrmd: [9940]: info: RA output: (vm-build11:start:stdout) Using config file "/mnt/xen/vm/build11".#012Started domain build11 (id=18)
> Aug 11 12:09:11 app-04 crmd: [9943]: info: process_lrm_event: LRM operation vm-build11_start_0 (call=91, rc=0, cib-update=147, confirmed=true) ok
> Aug 11 12:09:11 app-04 crmd: [9943]: info: do_lrm_rsc_op: Performing key=143:69:0:478c5322-daa4-462a-a333-6d26288fa416 op=vm-build11_monitor_30000 )
> Aug 11 12:10:11 app-04 lrmd: [9940]: info: rsc:vm-build11:93: monitor
> Aug 11 12:10:11 app-04 crmd: [9943]: info: process_lrm_event: LRM operation vm-build11_monitor_30000 (call=93, rc=0, cib-update=149, confirmed=false) ok
> ====
> Note the "Not active locally, migration failed!" error and subsequent stop/start.
>
> Aside: why is the stop timeout 20 sec? Looking at the Xen RA man page and source, it seems like it should use the shutdown_timeout parameter, or it that is not set, 2/3 of the stop timeout. I don't have shutdown_timeout set for this resource, and the stop timeout is 60, but somehow it's getting 20 instead of 40.
>
>
> With my patch in place, everything is fine after a couple seconds:
> (app-02 is the migration target in this case)
> ====
> Aug 11 15:50:08 app-02 lrmd: [22253]: debug: on_msg_perform_op: add an operation operation migrate_from[40] on ocf::Xen::vm-build11 for client 22256, its parameters: CRM_meta_migrate_source=[app-04] CRM_meta_migrate_target=[app-02] CRM_meta_record_pending=[false] CRM_meta_timeout=[30000] crm_feature_set=[3.0.5] xmfile=[/mnt/xen/vm/build11] to the operation list.
> Aug 11 15:50:08 app-02 Xen[24984]: [25055]: DEBUG: build11: Not yet active locally, waiting (timeout: 20s)
> Aug 11 15:50:09 app-02 Xen[24984]: [25163]: INFO: build11: Active locally, migration successful
> ====
>
> Incidentally, on a test where I stopped pacemaker on a node running 5 VMs, only one of them required this wait; the other four immediately returned "Active locally, migration successful". It's a race condition mitigated by the wait loop, I guess.
>
> Software is SLES 11 HAE SP1 + latest updates (Xen RA belongs to resource-agents-1.0.3-0.10.1).
>
>
> Thanks,
>
> Andrew Daugherity
> Systems Analyst
> Division of Research, Texas A&M University
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
_______________________________________________________
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/

On Fri, Sep 23, 2011 at 11:55:40AM +0200, Dejan Muhamedagic wrote:
> Hi,
>
> On Thu, Aug 11, 2011 at 09:07:00PM +0000, Daugherity, Andrew W wrote:
> > I have discovered that sometimes when migrating a VM, the migration itself will succeed, but the migrate_from call on the target node will fail, as apparently the status hasn't settled down yet. This is more likely to happen when stopping pacemaker on a node, causing all its VMs to migrate away. Migration succeeds, but then (sometimes) the status call in migrate_from fails, and the VM is unnecessarily stopped and started. Note that it is NOT a timeout problem, as the migrate_from operation (which only checks status) takes less than a second.
> >
> > I noticed the VirtualDomain RA does a loop rather than just checking the status once as the Xen RA does, so I patched a similar thing into the Xen RA, and that solved my problem. I don't know what the appropriate time to wait is, so I copied the "2/3 of the timeout" logic from the stop function. Perhaps the full timeout would be more appropriate?
>
> I guess so too. The patch below looks OK to me since migrate_from
> shouldn't take much time. But I don't have enough experience with
> migrations. Can somebody please take a look too. Moving
> discussion to linux-ha-dev.
>
> Many thanks for the patch!

Patch applied:

commit 022fd8075bc9ae58d65265cdc835bb8f7a4ef4be
Author: Daugherity, Andrew W <adaugherity@tamu.edu>
Date: Wed Sep 28 17:17:29 2011 +0200

Medium: Xen: wait in migrate_from for the migration to
finish instead of bailing out immediately

Signed-off-by: Dejan Muhamedagic <dejan@suse.de>

Thanks,

Dejan

> Dejan
>
> > Patch:
> > --- Xen.orig 2011-08-11 12:54:13.574795799 -0500
> > +++ Xen 2011-08-11 15:18:58.352851276 -0500
> > @@ -384,6 +384,20 @@
> > }
> >
> > Xen_Migrate_From() {
> > + if [ -n "$OCF_RESKEY_CRM_meta_timeout" ]; then
> > + # Allow 2/3 of the action timeout for status to stabilize
> > + # (The origin unit is ms, hence the conversion)
> > + timeout=$((OCF_RESKEY_CRM_meta_timeout/1500))
> > + else
> > + timeout=10 # should be plenty
> > + fi
> > +
> > + while ! Xen_Status ${DOMAIN_NAME} && [ $timeout -gt 0 ]; do
> > + ocf_log debug "$DOMAIN_NAME: Not yet active locally, waiting (timeout: ${timeout}s)"
> > + timeout=$((timeout-1))
> > + sleep 1
> > + done
> > +
> > if Xen_Status ${DOMAIN_NAME}; then
> > Xen_Adjust_Memory 0
> > ocf_log info "$DOMAIN_NAME: Active locally, migration successful"
> > === end patch ===
> >
> >
> > Example resource definition:
> > primitive vm-build11 ocf:heartbeat:Xen \
> > params xmfile="/mnt/xen/vm/build11" \
> > meta allow-migrate="true" target-role="Started" is-managed="true" \
> > utilization cores="1" mem="512" \
> > op start interval="0" timeout="60" \
> > op stop interval="0" timeout="60" \
> > op migrate_to interval="0" timeout="180" \
> > op monitor interval="30" timeout="30" start-delay="60"
> >
> > Debug log excerpts (grep build11):
> >
> > source node (app-02)
> > ====
> > Aug 11 12:08:33 app-02 lrmd: [9952]: debug: on_msg_perform_op: add an operation operation migrate_to[50] on ocf::Xen::vm-build11 for client 9955, its parameters: CRM_meta_migrate_source=[app-02] CRM_meta_migrate_target=[app-04] CRM_meta_record_pending=[false] CRM_meta_name=[migrate_to] CRM_meta_timeout=[180000] crm_feature_set=[3.0.5] xmfile=[/mnt/xen/vm/build11] to the operation list.
> > Aug 11 12:08:33 app-02 lrmd: [9952]: info: rsc:vm-build11:50: migrate_to
> > Aug 11 12:08:33 app-02 crmd: [9955]: info: process_lrm_event: LRM operation vm-build11_monitor_30000 (call=49, status=1, cib-update=0, confirmed=true) Cancelled
> > Aug 11 12:08:33 app-02 Xen[29909]: [29984]: INFO: build11: Starting xm migrate to app-04
> > Aug 11 12:08:43 app-02 Xen[29909]: [30228]: INFO: build11: xm migrate to app-04 succeeded.
> > Aug 11 12:08:43 app-02 lrmd: [9952]: info: Managed vm-build11:migrate_to process 29909 exited with return code 0.
> > Aug 11 12:08:43 app-02 crmd: [9955]: debug: create_operation_update: do_update_resource: Updating resouce vm-build11 after complete migrate_to op (interval=0)
> > Aug 11 12:08:43 app-02 crmd: [9955]: info: process_lrm_event: LRM operation vm-build11_migrate_to_0 (call=50, rc=0, cib-update=75, confirmed=true) ok
> > Aug 11 12:08:43 app-02 crmd: [9955]: info: do_lrm_rsc_op: Performing key=141:69:0:478c5322-daa4-462a-a333-6d26288fa416 op=vm-build11_stop_0 )
> > Aug 11 12:08:43 app-02 lrmd: [9952]: debug: on_msg_perform_op: add an operation operation stop[52] on ocf::Xen::vm-build11 for client 9955, its parameters: crm_feature_set=[3.0.5] to the operation list.
> > Aug 11 12:08:43 app-02 lrmd: [9952]: info: rsc:vm-build11:52: stop
> > Aug 11 12:08:44 app-02 Xen[30381]: [30469]: INFO: Xen domain build11 already stopped.
> > Aug 11 12:08:44 app-02 lrmd: [9952]: info: RA output: (vm-build11:stop:stderr) Error:
> > Aug 11 12:08:44 app-02 lrmd: [9952]: info: RA output: (vm-build11:stop:stderr)
> > Aug 11 12:08:44 app-02 lrmd: [9952]: info: RA output: (vm-build11:stop:stderr) Domain 'build11' does not exist.
> > Aug 11 12:08:44 app-02 lrmd: [9952]: info: RA output: (vm-build11:stop:stderr)
> > Aug 11 12:08:44 app-02 lrmd: [9952]: info: Managed vm-build11:stop process 30381 exited with return code 0.
> > Aug 11 12:08:44 app-02 crmd: [9955]: debug: create_operation_update: do_update_resource: Updating resouce vm-build11 after complete stop op (interval=0)
> > Aug 11 12:08:44 app-02 crmd: [9955]: info: process_lrm_event: LRM operation vm-build11_stop_0 (call=52, rc=0, cib-update=78, confirmed=true) ok
> > ====
> > (I'm guessing the "already stopped" after migrate_to succeeds is not a problem.)
> >
> >
> > target node (app-04):
> > ====
> > Aug 11 12:08:43 app-04 crmd: [9943]: info: do_lrm_rsc_op: Performing key=200:68:0:478c5322-daa4-462a-a333-6d26288fa416 op=vm-build11_migrate_from_0 )
> > Aug 11 12:08:43 app-04 lrmd: [9940]: info: rsc:vm-build11:88: migrate_from
> > Aug 11 12:08:43 app-04 Xen[1574]: [1619]: ERROR: build11: Not active locally, migration failed!
> > Aug 11 12:08:43 app-04 crmd: [9943]: info: process_lrm_event: LRM operation vm-build11_migrate_from_0 (call=88, rc=1, cib-update=144, confirmed=true) unknown error
> > Aug 11 12:08:43 app-04 crmd: [9943]: info: do_lrm_rsc_op: Performing key=16:69:0:478c5322-daa4-462a-a333-6d26288fa416 op=vm-build11_stop_0 )
> > Aug 11 12:08:43 app-04 lrmd: [9940]: info: rsc:vm-build11:89: stop
> > Aug 11 12:08:44 app-04 Xen[1744]: [1852]: INFO: Xen domain build11 will be stopped (timeout: 20s)
> > Aug 11 12:08:51 app-04 Xen[1744]: [2146]: INFO: Xen domain build11 stopped.
> > Aug 11 12:08:51 app-04 crmd: [9943]: info: process_lrm_event: LRM operation vm-build11_stop_0 (call=89, rc=0, cib-update=145, confirmed=true) ok
> > Aug 11 12:09:06 app-04 crmd: [9943]: info: do_lrm_rsc_op: Performing key=142:69:0:478c5322-daa4-462a-a333-6d26288fa416 op=vm-build11_start_0 )
> > Aug 11 12:09:06 app-04 lrmd: [9940]: info: rsc:vm-build11:91: start
> > Aug 11 12:09:10 app-04 lrmd: [9940]: info: RA output: (vm-build11:start:stdout) Using config file "/mnt/xen/vm/build11".#012Started domain build11 (id=18)
> > Aug 11 12:09:11 app-04 crmd: [9943]: info: process_lrm_event: LRM operation vm-build11_start_0 (call=91, rc=0, cib-update=147, confirmed=true) ok
> > Aug 11 12:09:11 app-04 crmd: [9943]: info: do_lrm_rsc_op: Performing key=143:69:0:478c5322-daa4-462a-a333-6d26288fa416 op=vm-build11_monitor_30000 )
> > Aug 11 12:10:11 app-04 lrmd: [9940]: info: rsc:vm-build11:93: monitor
> > Aug 11 12:10:11 app-04 crmd: [9943]: info: process_lrm_event: LRM operation vm-build11_monitor_30000 (call=93, rc=0, cib-update=149, confirmed=false) ok
> > ====
> > Note the "Not active locally, migration failed!" error and subsequent stop/start.
> >
> > Aside: why is the stop timeout 20 sec? Looking at the Xen RA man page and source, it seems like it should use the shutdown_timeout parameter, or it that is not set, 2/3 of the stop timeout. I don't have shutdown_timeout set for this resource, and the stop timeout is 60, but somehow it's getting 20 instead of 40.
> >
> >
> > With my patch in place, everything is fine after a couple seconds:
> > (app-02 is the migration target in this case)
> > ====
> > Aug 11 15:50:08 app-02 lrmd: [22253]: debug: on_msg_perform_op: add an operation operation migrate_from[40] on ocf::Xen::vm-build11 for client 22256, its parameters: CRM_meta_migrate_source=[app-04] CRM_meta_migrate_target=[app-02] CRM_meta_record_pending=[false] CRM_meta_timeout=[30000] crm_feature_set=[3.0.5] xmfile=[/mnt/xen/vm/build11] to the operation list.
> > Aug 11 15:50:08 app-02 Xen[24984]: [25055]: DEBUG: build11: Not yet active locally, waiting (timeout: 20s)
> > Aug 11 15:50:09 app-02 Xen[24984]: [25163]: INFO: build11: Active locally, migration successful
> > ====
> >
> > Incidentally, on a test where I stopped pacemaker on a node running 5 VMs, only one of them required this wait; the other four immediately returned "Active locally, migration successful". It's a race condition mitigated by the wait loop, I guess.
> >
> > Software is SLES 11 HAE SP1 + latest updates (Xen RA belongs to resource-agents-1.0.3-0.10.1).
> >
> >
> > Thanks,
> >
> > Andrew Daugherity
> > Systems Analyst
> > Division of Research, Texas A&M University
> >
> > _______________________________________________
> > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> > http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> >
> > Project Home: http://www.clusterlabs.org
> > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
> _______________________________________________________
> Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
> http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
> Home Page: http://linux-ha.org/
_______________________________________________________
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/