Mailing List Archive: HA Compute & Instance Evacuation

HA Compute & Instance Evacuation

May 2, 2018, 11:43 AM

Post #1 of 14 (2639 views)

I am working on setting up Openstack for HA and one of the last orders of
business is getting HA behavior out of the compute nodes. Is there a project
that will automatically evacuate instances from a downed or failed compute
host, and automatically reboot them on their new host? I'm curious what
suggestions people have about this, or whatever advice you might have. Is
there a best way of getting this functionality, or anything else I should be
aware of?

Thanks,

_______________________________________________
Mailing list: http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack
Post to : openstack@lists.openstack.org
Unsubscribe : http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack

Re: HA Compute & Instance Evacuation [ In reply to ]

jaypipes at gmail

May 2, 2018, 12:42 PM

Post #2 of 14 (2638 views)

Permalink

On 05/02/2018 02:43 PM, Torin Woltjer wrote:
> I am working on setting up Openstack for HA and one of the last orders of
> business is getting HA behavior out of the compute nodes.

There is no HA behaviour for compute nodes.

> Is there a project that will automatically evacuate instances from a
> downed or failed compute host, and automatically reboot them on their
> new host?
Check out Masakari:

https://wiki.openstack.org/wiki/Masakari

> I'm curious what suggestions people have about this, or whatever
> advice you might have. Is there a best way of getting this
> functionality, or anything else I should be aware of?

You are referring to HA of workloads running on compute nodes, not HA of
compute nodes themselves.

My advice would be to install Kubernetes on one or more VMs (with the
VMs acting as Kubernetes nodes) and use that project's excellent
orchestrator for daemonsets/statefulsets which is essentially the use
case you are describing.

The OpenStack Compute API (implemented in Nova) is not an orchestration
API. It's a low-level infrastructure API for executing basic actions on
compute resources.

Best,
-jay

_______________________________________________
Mailing list: http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack
Post to : openstack@lists.openstack.org
Unsubscribe : http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack

Re: HA Compute & Instance Evacuation [ In reply to ]

jpetrini at coredial

May 2, 2018, 1:37 PM

Post #3 of 14 (2638 views)

Permalink

We're using the original Masakari project for this and it works really
well. In fact just last week we lost a compute node and all of VM's were
successfully migrated to a reserve host in under 5 minutes. It's a really
nice feeling when your infrastructure heals itself before you even get a
chance to start troubleshooting.

It does require a good deal of configuration to get it up and running,
especially the clustering with Pacemaker/Corosync so be prepared to get
familiar with those tools and STONITH if you're not already. Worth it if
some of your infrastructure doesn't have redundancy built in at higher
level.

Re: HA Compute & Instance Evacuation [ In reply to ]

torin.woltjer at granddial

May 2, 2018, 1:39 PM

Post #4 of 14 (2638 views)

Permalink

> There is no HA behaviour for compute nodes.
>
> You are referring to HA of workloads running on compute nodes, not HA of
> compute nodes themselves.
It was a mistake for me to say HA when referring to compute and instances. Really I want to avoid a situation where one of my compute hosts gives up the ghost, and all of the instances are offline until someone reboots them on a different host. I would like them to automatically reboot on a healthy compute node.

> Check out Masakari:
>
> https://wiki.openstack.org/wiki/Masakari
This looks like the kind of thing I'm searching for.

I'm seeing 3 components here, I'm assuming one goes on compute hosts and one or both of the others go on the control nodes? Is there any documentation outlining the procedure for deploying this? Will there be any problem running the Masakari API service on 2 machines simultaneously, sitting behind HAProxy?

Re: [masakari] HA Compute & Instance Evacuation [ In reply to ]

jaypipes at gmail

May 2, 2018, 1:46 PM

Post #5 of 14 (2638 views)

Permalink

On 05/02/2018 04:39 PM, Torin Woltjer wrote:
> > There is no HA behaviour for compute nodes.
> >
> > You are referring to HA of workloads running on compute nodes, not HA of
> > compute nodes themselves.
> It was a mistake for me to say HA when referring to compute and
> instances. Really I want to avoid a situation where one of my compute
> hosts gives up the ghost, and all of the instances are offline until
> someone reboots them on a different host. I would like them to
> automatically reboot on a healthy compute node.
>
> > Check out Masakari:
> >
> > https://wiki.openstack.org/wiki/Masakari
> This looks like the kind of thing I'm searching for.
>
> I'm seeing 3 components here, I'm assuming one goes on compute hosts and
> one or both of the others go on the control nodes?

I don't believe anything goes on the compute nodes, no. I'm pretty sure
the Masakari API service and engine workers live on controller nodes.

> Is there any documentation outlining the procedure for deploying
> this? Will there be any problem running the Masakari API service on 2
> machines simultaneously, sitting behind HAProxy?
Not sure. I'll leave it up to the Masakari developers to help out here.
I've added [masakari] topic to the subject line.

Best,
-jay

_______________________________________________
Mailing list: http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack
Post to : openstack@lists.openstack.org
Unsubscribe : http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack

Re: HA Compute & Instance Evacuation [ In reply to ]

torin.woltjer at granddial

May 2, 2018, 2:24 PM

Post #6 of 14 (2632 views)

Permalink

I'm vaguely familiar with Pacemaker/Corosync, as I'm using it with HAProxy on my controller nodes. I'm assuming in this instance that you use Pacemaker on your compute hosts so masakari can detect host outages? If possible could you go into more detail about the configuration? I would like to use Masakari and I'm having trouble finding a step by step or other documentation to get started with.

Re: HA Compute & Instance Evacuation [ In reply to ]

jpetrini at coredial

May 2, 2018, 5:21 PM

Post #7 of 14 (2632 views)

Permalink

Take this with a grain of salt because we're using the original version
before the project moved under the Big Tent and I'm not sure how much it's
evolved since then. I assume the basic functions are the same though.

You're correct; Corosync and Pacemaker are used to determine if a compute
node goes down. The masakari-host-monitor process runs on each compute node
and checks the cluster status and sends a notification to
masakari-controller when a node goes down. The controller process keeps a
list of reserved hosts in it's database and calls nova host-evacuate to
move the Instances to one of the reserved hosts.

In our environment I also configured STONITH and I'd highly recommend it.
With STONITH Pacemaker sends a shutdown command to the Out of Band
Management card of the unreachable node to make sure that it can't come
back and cause a conflict.

There are two other components, masakari-process-monitor and
masakari-instance-monitor. These also run on your compute nodes. The former
watches the nova-compute service and the later monitors running instances
and restarts them if necessary.

Looking here it seems they've split Masakari into thee different repos:
https://github.com/openstack?utf8=%E2%9C%93&q=masakari&type=&language=

masakari - The controller service and API
masakari-monitors - Compute node monitoring services
python-masakari-client - The cli tools

Re: HA Compute & Instance Evacuation [ In reply to ]

torin.woltjer at granddial

May 4, 2018, 11:43 AM

Post #8 of 14 (2628 views)

Permalink

Thank you very much for the information. Just for clarification, when you say reserved hosts, do you mean that I must keep unloaded virtualization hosts in reserve? Or can Masakari move instances from a downed host to an already loaded host that has open capacity?

Re: HA Compute & Instance Evacuation [ In reply to ]

Tushar.Patil at nttdata

May 6, 2018, 7:41 PM

Post #9 of 14 (2618 views)

Permalink

Hi Torin,

Masakari supports 4 different types of recovery methods at the time of creation of failover_segment.

1. auto: It will let nova decide on which compute host the instances should be evacuated.

2. reserved_host: You will first need to add reserved hosts to the failover segments. Masakari engine will select the first available reserved host from the failover segment, enable compute service in nova and then use that reserved host to evacuate the instances from the failed compute host.

3. auto_priority: it will first try to evacuate instances using 'auto' recovery method, if it's fails then it attempts to evacuate using "reserved_host" recovery method.

4. rh_priority: It's opposite of above "auto_priority" recovery method. it will first try to evacuate instances using 'reserved_host' recovery method, if it's fails then it attempts to evacuate using "auto" recovery method.

In your case you will need to use "auto" recovery method.

Please refer to the below documentation links for more details.

Masakari system architecture:
https://docs.openstack.org/masakari/latest/

Masakari api-ref:
https://developer.openstack.org/api-ref/instance-ha/

To install masakari-monitors with pacemaker/corosync:
https://review.openstack.org/#/c/489095/6/doc/source/install_and_configure_debian.rst

Other ways to reach us: Masakari weekly meeting on #openstack-meeting IRC channel on every Tuesday at 0400 UTC or else you can post your queries on #openstack-masakari IRC channel.

Regards,
Tushar

________________________________________
From: Torin Woltjer <torin.woltjer@granddial.com>
Sent: Saturday, May 5, 2018 3:43:05 AM
To: jpetrini@coredial.com
Cc: openstack@lists.openstack.org
Subject: Re: [Openstack] HA Compute & Instance Evacuation

Thank you very much for the information. Just for clarification, when you say reserved hosts, do you mean that I must keep unloaded virtualization hosts in reserve? Or can Masakari move instances from a downed host to an already loaded host that has open capacity?
Disclaimer: This email and any attachments are sent in strictest confidence for the sole use of the addressee and may contain legally privileged,confidential, and proprietary data. If you are not the intended recipient,please advise the sender by replying promptly to this email and then delete and destroy this email and any attachments without any further use, copying or forwarding.

_______________________________________________
Mailing list: http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack
Post to : openstack@lists.openstack.org
Unsubscribe : http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack

Re: HA Compute & Instance Evacuation [ In reply to ]

Pablo.Iranzo at redhat

May 7, 2018, 12:06 AM

Post #10 of 14 (2618 views)

Permalink

+++ Torin Woltjer [02/05/18 20:39 +0000]:
>> There is no HA behaviour for compute nodes.
>>
>> You are referring to HA of workloads running on compute nodes, not HA of
>> compute nodes themselves.
>It was a mistake for me to say HA when referring to compute and instances. Really I want to avoid a situation where one of my compute hosts gives up the ghost, and all of the instances are offline until someone reboots them on a different host. I would like them to automatically reboot on a healthy compute node.
>
>> Check out Masakari:
>>
>> https://wiki.openstack.org/wiki/Masakari
>This looks like the kind of thing I'm searching for.
>
>I'm seeing 3 components here, I'm assuming one goes on compute hosts and one or both of the others go on the control nodes? Is there any documentation outlining the procedure for deploying this? Will there be any problem running the Masakari API service on 2 machines simultaneously, sitting behind HAProxy?

Check for 'Instance HA':

https://blueprints.launchpad.net/tripleo/+spec/instance-ha

Which more or less came with:

https://github.com/beekhof/osp-ha-deploy/blob/master/pcmk/compute-managed.scenario
https://github.com/beekhof/osp-ha-deploy/blob/master/pcmk/controller-managed.scenario

Ansible scripts are at git://github.com/redhat-openstack/tripleo-quickstart-utils

And enabled via: ansible-playbook
/home/stack/ansible-instanceha/playbooks/overcloud-instance-ha.yml \
-e release="RELEASE"

This of course requires a valid HA deployment setup on the controllers
(usually tripleO or OSP Director).

Regards,
Pablo

>

>_______________________________________________
>Mailing list: http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack
>Post to : openstack@lists.openstack.org
>Unsubscribe : http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack

--

Pablo Iranzo G?mez (Pablo.Iranzo@redhat.com) GnuPG: 0x5BD8E1E4
Principal Software Maintenance Engineer - OpenStack iranzo @ IRC
RHC{A,SS,DS,VA,E,SA,SP,AOSP}, JBCAA #110-215-852 RHCA Level V

Blog: https://iranzo.github.io Citellus: https://citellus.org

Re: HA Compute & Instance Evacuation [ In reply to ]

torin.woltjer at granddial

May 10, 2018, 7:08 AM

Post #11 of 14 (2613 views)

Permalink

Hi Tushar,

I followed the documentation to set up the masakari monitors, after I
installed the masakari API. None of the monitor services seem to work. I keep
getting an error: "AttributeError: 'module' object has no attribute 'URI'"
Here is the full output: http://paste.openstack.org/show/720761/
Are you aware of what causes the issue? Can you provide any example configs
for a working masakari setup?

On Sunday, May 6, 2018 10:41:48 PM EDT Patil, Tushar wrote:
> Hi Torin,
>
> Masakari supports 4 different types of recovery methods at the time of
> creation of failover_segment.
>
> 1. auto: It will let nova decide on which compute host the instances should
> be evacuated.
>
> 2. reserved_host: You will first need to add reserved hosts to the failover
> segments. Masakari engine will select the first available reserved host
> from the failover segment, enable compute service in nova and then use that
> reserved host to evacuate the instances from the failed compute host.
>
> 3. auto_priority: it will first try to evacuate instances using 'auto'
> recovery method, if it's fails then it attempts to evacuate using
> "reserved_host" recovery method.
>
> 4. rh_priority: It's opposite of above "auto_priority" recovery method. it
> will first try to evacuate instances using 'reserved_host' recovery method,
> if it's fails then it attempts to evacuate using "auto" recovery method.
>
> In your case you will need to use "auto" recovery method.
>
> Please refer to the below documentation links for more details.
>
> Masakari system architecture:
> https://docs.openstack.org/masakari/latest/
>
> Masakari api-ref:
> https://developer.openstack.org/api-ref/instance-ha/
>
> To install masakari-monitors with pacemaker/corosync:
> https://review.openstack.org/#/c/489095/6/doc/source/install_and_configure_d
> ebian.rst
>
> Other ways to reach us: Masakari weekly meeting on #openstack-meeting IRC
> channel on every Tuesday at 0400 UTC or else you can post your queries on
> #openstack-masakari IRC channel.
>
> Regards,
> Tushar

_______________________________________________
Mailing list: http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack
Post to : openstack@lists.openstack.org
Unsubscribe : http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack

Re: HA Compute & Instance Evacuation [ In reply to ]

Tushar.Patil at nttdata

May 10, 2018, 9:40 PM

Post #12 of 14 (2612 views)

Permalink

Hi Torin,

Presently, masakari-monitors is completely broken. Extremely sorry for the inconvenience.

I think this is what is needed to make it work.
Install openstacksdk version 0.13.0.

Apply patch: https://review.openstack.org/#/c/546492/

In this patch ,we need to bump openstacksdk version from 0.11.2 to 0.13.0.
We will merge above patch soon.

Regards,
Tushar Patil

________________________________________
From: Torin Woltjer <torin.woltjer@granddial.com>
Sent: Thursday, May 10, 2018 11:08:58 PM
To: Patil, Tushar
Cc: jpetrini@coredial.com; openstack@lists.openstack.org
Subject: Re: [Openstack] HA Compute & Instance Evacuation

Hi Tushar,

I followed the documentation to set up the masakari monitors, after I
installed the masakari API. None of the monitor services seem to work. I keep
getting an error: "AttributeError: 'module' object has no attribute 'URI'"
Here is the full output: http://paste.openstack.org/show/720761/
Are you aware of what causes the issue? Can you provide any example configs
for a working masakari setup?

On Sunday, May 6, 2018 10:41:48 PM EDT Patil, Tushar wrote:
> Hi Torin,
>
> Masakari supports 4 different types of recovery methods at the time of
> creation of failover_segment.
>
> 1. auto: It will let nova decide on which compute host the instances should
> be evacuated.
>
> 2. reserved_host: You will first need to add reserved hosts to the failover
> segments. Masakari engine will select the first available reserved host
> from the failover segment, enable compute service in nova and then use that
> reserved host to evacuate the instances from the failed compute host.
>
> 3. auto_priority: it will first try to evacuate instances using 'auto'
> recovery method, if it's fails then it attempts to evacuate using
> "reserved_host" recovery method.
>
> 4. rh_priority: It's opposite of above "auto_priority" recovery method. it
> will first try to evacuate instances using 'reserved_host' recovery method,
> if it's fails then it attempts to evacuate using "auto" recovery method.
>
> In your case you will need to use "auto" recovery method.
>
> Please refer to the below documentation links for more details.
>
> Masakari system architecture:
> https://docs.openstack.org/masakari/latest/
>
> Masakari api-ref:
> https://developer.openstack.org/api-ref/instance-ha/
>
> To install masakari-monitors with pacemaker/corosync:
> https://review.openstack.org/#/c/489095/6/doc/source/install_and_configure_d
> ebian.rst
>
> Other ways to reach us: Masakari weekly meeting on #openstack-meeting IRC
> channel on every Tuesday at 0400 UTC or else you can post your queries on
> #openstack-masakari IRC channel.
>
> Regards,
> Tushar

Disclaimer: This email and any attachments are sent in strictest confidence for the sole use of the addressee and may contain legally privileged,confidential, and proprietary data. If you are not the intended recipient,please advise the sender by replying promptly to this email and then delete and destroy this email and any attachments without any further use, copying or forwarding.

_______________________________________________
Mailing list: http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack
Post to : openstack@lists.openstack.org
Unsubscribe : http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack

Re: HA Compute & Instance Evacuation [ In reply to ]

torin.woltjer at granddial

May 11, 2018, 7:46 AM

Post #13 of 14 (2612 views)

Permalink

On Friday, May 11, 2018 12:40:58 AM EDT Patil, Tushar wrote:
> I think this is what is needed to make it work.
> Install openstacksdk version 0.13.0.
>
> Apply patch: https://review.openstack.org/#/c/546492/
>
> In this patch ,we need to bump openstacksdk version from 0.11.2 to 0.13.0.
> We will merge above patch soon.

Do you have a timetable on when the patch will be merged? If it is a
relatively small window of time, I would rather wait to use the patched
mainline code. Otherwise, I am willing to try to work with the patch.
Additionally, patching python is something that I am not familiar with. Is
there a good resource on doing this?

You have been a great help so far, thanks again.

_______________________________________________
Mailing list: http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack
Post to : openstack@lists.openstack.org
Unsubscribe : http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack

Re: HA Compute & Instance Evacuation [ In reply to ]

Tushar.Patil at nttdata

May 14, 2018, 1:07 AM

Post #14 of 14 (2606 views)

Permalink

Hi Torin,

>> Do you have a timetable on when the patch will be merged? If it is a relatively small window of time, I would rather wait to use
>> the patched mainline code.
You should be able to test masakari successfully as below three patches are already merged.

1. https://review.openstack.org/#/c/546492/15 - openstack/masakari-monitors (it doesn't use masakariclient any more)
2. https://review.openstack.org/#/c/567781/ - openstack/requirements (openstacksdk lower constraints updated to 0.13.0)
3. https://review.openstack.org/#/c/536653/ - openstack/masakari (change service-type from "ha" to "instance-ha".

If you are planning to install Openstack using latest devstack, then it will install openstacksdk 0.13.0 by default. No need to take any further action by yourself otherwise you need to ensure that you have correct version of openstacksdk (0.13.0) and also add masakari endpoint to use the correct service-type. Recommend to install latest masakari using devstack.

4. https://review.openstack.org/#/c/557634/2 - python-masakariclient (This patch needs to be merged ASAP)
If you are planning to use python-masakariclient to create failover segments or add hosts etc, then you will need to wait until this patch is merged. We need to update this patch to add correct version of openstacksdk in requirements.txt. We will merge this particular patch by tomorrow. But if you plan to add failover segment/hosts by calling RestFul API using curl or any other method, then probably you won't face any issues.

Regards,
Tushar Patil

________________________________________
From: Torin Woltjer <torin.woltjer@granddial.com>
Sent: Friday, May 11, 2018 11:46:05 PM
To: Patil, Tushar
Cc: jpetrini@coredial.com; openstack@lists.openstack.org
Subject: Re: [Openstack] HA Compute & Instance Evacuation

On Friday, May 11, 2018 12:40:58 AM EDT Patil, Tushar wrote:
> I think this is what is needed to make it work.
> Install openstacksdk version 0.13.0.
>
> Apply patch: https://review.openstack.org/#/c/546492/
>
> In this patch ,we need to bump openstacksdk version from 0.11.2 to 0.13.0.
> We will merge above patch soon.

Do you have a timetable on when the patch will be merged? If it is a
relatively small window of time, I would rather wait to use the patched
mainline code. Otherwise, I am willing to try to work with the patch.
Additionally, patching python is something that I am not familiar with. Is
there a good resource on doing this?

You have been a great help so far, thanks again.

Disclaimer: This email and any attachments are sent in strictest confidence for the sole use of the addressee and may contain legally privileged,confidential, and proprietary data. If you are not the intended recipient,please advise the sender by replying promptly to this email and then delete and destroy this email and any attachments without any further use, copying or forwarding.

_______________________________________________
Mailing list: http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack
Post to : openstack@lists.openstack.org
Unsubscribe : http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack

Mailing List Archive

Mailing List Archive

Attached Files: