Mailing List Archive

Problem with migration, priority, stickiness
Cluster s/w specs:
Kernel: 2.6.32-431.17.1.el6.x86_64
OS: CentOS 6.5
corosync-1.4.1-17.el6_5.1.x86_64
pacemaker-1.1.10-14.el6_5.3.x86_64
crmsh-2.0+git46-1.1.x86_64


Attached to this email are two text files, one contains the output of 'crm
configure show' (addresses sanitized) and the other contains the output of
'crm_simulate -sL'

Here is the situation, and we've encountered this multiple times now and
I've been unable to solve it:

* A machine in the cluster fails
* There is a spare node, unused, in the cluster available for
assignment
* The resource group that was on the failed machine, instead of
being put onto the spare, unused node is placed on a node where
another resource group is already running
* The displaced resource group then is launched on the spare,
unused node

As an example, this morning the following occurred:

Resource Group NRTMASTER is running on system gpmhac01
Resource Group NRTPNODE1 is running on system gpmhac02
Resource Group NRTPNODE2 is running on system gpmhac05
Resource Group NRTPNODE3 is running on system gpmhac04
Resource Group NRTPNODE4 is running on system gpmhac03

system gpmhac06 is up, available, and unused

system gpmhac04 fails and powers off

Resource Group NRTPNODE3 is moved to system gpmhac05
Resource Group NRTNPODE2 is moved to system gpmhac06


One of the big things that seems to occur here is that while the group
NRTPNODE3 is being launched on gpmhac05, the group NRTPNODE2 is being shut
down simultaneously which is causing race conditions where one start
script is putting a state file in place, while the stop script is erasing
it. This leaves the system in an unuseable state because required files,
parameters, and settings are missing/corrupted.

Secondly, there is simply no reason to kill a perfectly healthy resource
group, that is operating just fine in order to launch a resource group
whose machine has failed when:
1. There's a spare node available
2. The resource groups have equal priority with each other, i.e.
all of the NRTPNODE# resource groups have priority "60"


So I really need some help here in getting this setup so that it behaves
the way we *think* it should be doing based on what we understand of the
Pacemaker architecture. Obviously we're missing something since this
resource group "shuffling" occurs when there's a failed system, despite
having an unused, spare node available for immediate use, and has bitten
us several times. The fact that the race condition between startup and
shutdown is also causing the system that is brought up to be useless is
exacerbating the situation immensely.

Ideally, this is what we want:

1. If a system fails, the resources/resource group running on it
are moved to an unused, available system. No other resource
shuffling occurs amongst system occurs.

2. If a system fails and there is not an unused, available
system to fail over to, then IF the resource group has a higher
priority than another resource group, the group with the lower
priority is shutdown. Only when that shutdown is complete will
the resource group with the higher priority start its startup of
resources.

3. If a system fails and there is not an unused, available
system to fail over to, then IF the resource group has the same
or lower priority to all other resource groups, then it will not
attempt to launch itself on any other node, nor cause any other
resource group to stop or migrate.

4. Unless specifically, and manually, ordered to move OR if the
hardware system fails, a resource group should remain on its
current hardware system. It should never be forced to migrate to
a new system because something of equal or lower priority failed
and migrated to a new system.

5. We do not need resource groups to fail back to original nodes,
when running we want them to stay running on their current system
until/unless a hardware failure occurs and forces them off the
system, or we manually tell them to move.


Can someone please look over our configuration, and the bizzare scores
that I see from the crm_simulate output, and help get me to the point
where I can achieve an HA cluster that doesn't kill healthy resources in
some kind of game of musical chairs when there's an empty chair available.
Can you also tell me why or help me to ensure that a startup doesn't occur
until AFTER a shutdown is completely done?


Obviously we're misunderstanding or misapplying something with our
resource stickiness, or resource group priority, or something and we
really need to get this resolved. My job is literally on the line with
this due to these failures to operate in the fashion we expect. So all
help is appreciated.


Thanks
Tony

--
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.
Re: Problem with migration, priority, stickiness [ In reply to ]
----- Original Message -----
> From: "Tony Stocker" <tony.stocker@nasa.gov>
> To: "Linux HA Cluster Development List" <linux-ha@lists.linux-ha.org>
> Sent: Tuesday, May 20, 2014 8:18:52 AM
> Subject: [Linux-HA] Problem with migration, priority, stickiness
>
> Cluster s/w specs:
> Kernel: 2.6.32-431.17.1.el6.x86_64
> OS: CentOS 6.5
> corosync-1.4.1-17.el6_5.1.x86_64
> pacemaker-1.1.10-14.el6_5.3.x86_64
> crmsh-2.0+git46-1.1.x86_64
>
>
> Attached to this email are two text files, one contains the output of 'crm
> configure show' (addresses sanitized) and the other contains the output of
> 'crm_simulate -sL'
>
> Here is the situation, and we've encountered this multiple times now and
> I've been unable to solve it:
>
> * A machine in the cluster fails
> * There is a spare node, unused, in the cluster available for
> assignment
> * The resource group that was on the failed machine, instead of
> being put onto the spare, unused node is placed on a node where
> another resource group is already running
> * The displaced resource group then is launched on the spare,
> unused node
>
> As an example, this morning the following occurred:
>
> Resource Group NRTMASTER is running on system gpmhac01
> Resource Group NRTPNODE1 is running on system gpmhac02
> Resource Group NRTPNODE2 is running on system gpmhac05
> Resource Group NRTPNODE3 is running on system gpmhac04
> Resource Group NRTPNODE4 is running on system gpmhac03
>
> system gpmhac06 is up, available, and unused
>
> system gpmhac04 fails and powers off
>
> Resource Group NRTPNODE3 is moved to system gpmhac05
> Resource Group NRTNPODE2 is moved to system gpmhac06
>
>
> One of the big things that seems to occur here is that while the group
> NRTPNODE3 is being launched on gpmhac05, the group NRTPNODE2 is being shut
> down simultaneously which is causing race conditions where one start
> script is putting a state file in place, while the stop script is erasing
> it. This leaves the system in an unuseable state because required files,
> parameters, and settings are missing/corrupted.
>
> Secondly, there is simply no reason to kill a perfectly healthy resource
> group, that is operating just fine in order to launch a resource group
> whose machine has failed when:
> 1. There's a spare node available
> 2. The resource groups have equal priority with each other, i.e.
> all of the NRTPNODE# resource groups have priority "60"
>
>
> So I really need some help here in getting this setup so that it behaves
> the way we *think* it should be doing based on what we understand of the
> Pacemaker architecture. Obviously we're missing something since this
> resource group "shuffling" occurs when there's a failed system, despite
> having an unused, spare node available for immediate use, and has bitten
> us several times. The fact that the race condition between startup and
> shutdown is also causing the system that is brought up to be useless is
> exacerbating the situation immensely.
>
> Ideally, this is what we want:
>
> 1. If a system fails, the resources/resource group running on it
> are moved to an unused, available system. No other resource
> shuffling occurs amongst system occurs.
>
> 2. If a system fails and there is not an unused, available
> system to fail over to, then IF the resource group has a higher
> priority than another resource group, the group with the lower
> priority is shutdown. Only when that shutdown is complete will
> the resource group with the higher priority start its startup of
> resources.
>
> 3. If a system fails and there is not an unused, available
> system to fail over to, then IF the resource group has the same
> or lower priority to all other resource groups, then it will not
> attempt to launch itself on any other node, nor cause any other
> resource group to stop or migrate.
>
> 4. Unless specifically, and manually, ordered to move OR if the
> hardware system fails, a resource group should remain on its
> current hardware system. It should never be forced to migrate to
> a new system because something of equal or lower priority failed
> and migrated to a new system.
>
> 5. We do not need resource groups to fail back to original nodes,
> when running we want them to stay running on their current system
> until/unless a hardware failure occurs and forces them off the
> system, or we manually tell them to move.
>
>
> Can someone please look over our configuration, and the bizzare scores
> that I see from the crm_simulate output, and help get me to the point
> where I can achieve an HA cluster that doesn't kill healthy resources in
> some kind of game of musical chairs when there's an empty chair available.
> Can you also tell me why or help me to ensure that a startup doesn't occur
> until AFTER a shutdown is completely done?
>
>
> Obviously we're misunderstanding or misapplying something with our
> resource stickiness, or resource group priority, or something and we
> really need to get this resolved. My job is literally on the line with
> this due to these failures to operate in the fashion we expect. So all
> help is appreciated.

What happens if you set resource-stickiness higher. Say to something like 6000, perhaps that would combat the location constraint scores you have going on.

-- Vossel


> Thanks
> Tony
>
> --
> This message has been scanned for viruses and
> dangerous content by MailScanner, and is
> believed to be clean.
>
>
> _______________________________________________
> Linux-HA mailing list
> Linux-HA@lists.linux-ha.org
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
_______________________________________________
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems