Mailing List Archive: Re: Avoid one node from being a target for resources migration

Re: Avoid one node from being a target for resources migration

Jan 12, 2015, 9:25 AM

Post #1 of 3 (1032 views)

----- Original Message -----
> Hello.
>
> I have 3-node cluster managed by corosync+pacemaker+crm. Node1 and Node2 are
> DRBD master-slave, also they have a number of other services installed
> (postgresql, nginx, ...). Node3 is just a corosync node (for quorum), no
> DRBD/postgresql/... are installed at it, only corosync+pacemaker.
>
> But when I add resources to the cluster, a part of them are somehow moved to
> node3 and since then fail. Note than I have a "colocation" directive to
> place these resources to the DRBD master only and "location" with -inf for
> node3, but this does not help - why? How to make pacemaker not run anything
> at node3?
>
> All the resources are added in a single transaction: "cat config.txt | crm -w
> -f- configure" where config.txt contains directives and "commit" statement
> at the end.
>
> Below are "crm status" (error messages) and "crm configure show" outputs.
>
>
> root@node3:~# crm status
> Current DC: node2 (1017525950) - partition with quorum
> 3 Nodes configured
> 6 Resources configured
> Online: [ node1 node2 node3 ]
> Master/Slave Set: ms_drbd [drbd]
> Masters: [ node1 ]
> Slaves: [ node2 ]
> Resource Group: server
> fs (ocf::heartbeat:Filesystem): Started node1
> postgresql (lsb:postgresql): Started node3 FAILED
> bind9 (lsb:bind9): Started node3 FAILED
> nginx (lsb:nginx): Started node3 (unmanaged) FAILED
> Failed actions:
> drbd_monitor_0 (node=node3, call=744, rc=5, status=complete,
> last-rc-change=Mon Jan 12 11:16:43 2015, queued=2ms, exec=0ms): not
> installed
> postgresql_monitor_0 (node=node3, call=753, rc=1, status=complete,
> last-rc-change=Mon Jan 12 11:16:43 2015, queued=8ms, exec=0ms): unknown
> error
> bind9_monitor_0 (node=node3, call=757, rc=1, status=complete,
> last-rc-change=Mon Jan 12 11:16:43 2015, queued=11ms, exec=0ms): unknown
> error
> nginx_stop_0 (node=node3, call=767, rc=5, status=complete, last-rc-change=Mon
> Jan 12 11:16:44 2015, queued=1ms, exec=0ms): not installed

Here's what is going on. Even when you say "never run this resource on node3"
pacemaker is going to probe for the resource regardless on node3 just to verify
the resource isn't running.

The failures you are seeing "monitor_0 failed" indicate that pacemaker failed
to be able to verify resources are running on node3 because the related
packages for the resources are not installed. Given pacemaker's default
behavior I'd expect this.

You have two options.

1. install the resource related packages on node3 even though you never want
them to run there. This will allow the resource-agents to verify the resource
is in fact inactive.

2. If you are using the current master branch of pacemaker, there's a new
location constraint option called 'resource-discovery=always|never|exclusive'.
If you add the 'resource-discovery=never' option to your location constraint
that attempts to keep resources from node3, you'll avoid having pacemaker
perform the 'monitor_0' actions on node3 as well.

-- Vossel

>
> root@node3:~# crm configure show | cat
> node $id="1017525950" node2
> node $id="13071578" node3
> node $id="1760315215" node1
> primitive drbd ocf:linbit:drbd \
> params drbd_resource="vlv" \
> op start interval="0" timeout="240" \
> op stop interval="0" timeout="120"
> primitive fs ocf:heartbeat:Filesystem \
> params device="/dev/drbd0" directory="/var/lib/vlv.drbd/root"
> options="noatime,nodiratime" fstype="xfs" \
> op start interval="0" timeout="300" \
> op stop interval="0" timeout="300"
> primitive postgresql lsb:postgresql \
> op monitor interval="10" timeout="60" \
> op start interval="0" timeout="60" \
> op stop interval="0" timeout="60"
> primitive bind9 lsb:bind9 \
> op monitor interval="10" timeout="60" \
> op start interval="0" timeout="60" \
> op stop interval="0" timeout="60"
> primitive nginx lsb:nginx \
> op monitor interval="10" timeout="60" \
> op start interval="0" timeout="60" \
> op stop interval="0" timeout="60"
> group server fs postgresql bind9 nginx
> ms ms_drbd drbd meta master-max="1" master-node-max="1" clone-max="2"
> clone-node-max="1" notify="true"
> location loc_server server rule $id="loc_server-rule" -inf: #uname eq node3
> colocation col_server inf: server ms_drbd:Master
> order ord_server inf: ms_drbd:promote server:start
> property $id="cib-bootstrap-options" \
> stonith-enabled="false" \
> last-lrm-refresh="1421079189" \
> maintenance-mode="false"
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>

_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: Avoid one node from being a target for resources migration [ In reply to ]

andrew at beekhof

Jan 12, 2015, 6:18 PM

Post #2 of 3 (1013 views)

Permalink

> On 13 Jan 2015, at 4:25 am, David Vossel <dvossel@redhat.com> wrote:
>
>
>
> ----- Original Message -----
>> Hello.
>>
>> I have 3-node cluster managed by corosync+pacemaker+crm. Node1 and Node2 are
>> DRBD master-slave, also they have a number of other services installed
>> (postgresql, nginx, ...). Node3 is just a corosync node (for quorum), no
>> DRBD/postgresql/... are installed at it, only corosync+pacemaker.
>>
>> But when I add resources to the cluster, a part of them are somehow moved to
>> node3 and since then fail. Note than I have a "colocation" directive to
>> place these resources to the DRBD master only and "location" with -inf for
>> node3, but this does not help - why? How to make pacemaker not run anything
>> at node3?
>>
>> All the resources are added in a single transaction: "cat config.txt | crm -w
>> -f- configure" where config.txt contains directives and "commit" statement
>> at the end.
>>
>> Below are "crm status" (error messages) and "crm configure show" outputs.
>>
>>
>> root@node3:~# crm status
>> Current DC: node2 (1017525950) - partition with quorum
>> 3 Nodes configured
>> 6 Resources configured
>> Online: [ node1 node2 node3 ]
>> Master/Slave Set: ms_drbd [drbd]
>> Masters: [ node1 ]
>> Slaves: [ node2 ]
>> Resource Group: server
>> fs (ocf::heartbeat:Filesystem): Started node1
>> postgresql (lsb:postgresql): Started node3 FAILED
>> bind9 (lsb:bind9): Started node3 FAILED
>> nginx (lsb:nginx): Started node3 (unmanaged) FAILED
>> Failed actions:
>> drbd_monitor_0 (node=node3, call=744, rc=5, status=complete,
>> last-rc-change=Mon Jan 12 11:16:43 2015, queued=2ms, exec=0ms): not
>> installed
>> postgresql_monitor_0 (node=node3, call=753, rc=1, status=complete,
>> last-rc-change=Mon Jan 12 11:16:43 2015, queued=8ms, exec=0ms): unknown
>> error
>> bind9_monitor_0 (node=node3, call=757, rc=1, status=complete,
>> last-rc-change=Mon Jan 12 11:16:43 2015, queued=11ms, exec=0ms): unknown
>> error
>> nginx_stop_0 (node=node3, call=767, rc=5, status=complete, last-rc-change=Mon
>> Jan 12 11:16:44 2015, queued=1ms, exec=0ms): not installed
>
> Here's what is going on. Even when you say "never run this resource on node3"
> pacemaker is going to probe for the resource regardless on node3 just to verify
> the resource isn't running.
>
> The failures you are seeing "monitor_0 failed" indicate that pacemaker failed
> to be able to verify resources are running on node3 because the related
> packages for the resources are not installed. Given pacemaker's default
> behavior I'd expect this.
>
> You have two options.
>
> 1. install the resource related packages on node3 even though you never want
> them to run there. This will allow the resource-agents to verify the resource
> is in fact inactive.

or 1b. delete the agent too. recent versions of pacemaker should handle this case correctly.

>
> 2. If you are using the current master branch of pacemaker, there's a new
> location constraint option called 'resource-discovery=always|never|exclusive'.
> If you add the 'resource-discovery=never' option to your location constraint
> that attempts to keep resources from node3, you'll avoid having pacemaker
> perform the 'monitor_0' actions on node3 as well.
>
> -- Vossel
>
>>
>> root@node3:~# crm configure show | cat
>> node $id="1017525950" node2
>> node $id="13071578" node3
>> node $id="1760315215" node1
>> primitive drbd ocf:linbit:drbd \
>> params drbd_resource="vlv" \
>> op start interval="0" timeout="240" \
>> op stop interval="0" timeout="120"
>> primitive fs ocf:heartbeat:Filesystem \
>> params device="/dev/drbd0" directory="/var/lib/vlv.drbd/root"
>> options="noatime,nodiratime" fstype="xfs" \
>> op start interval="0" timeout="300" \
>> op stop interval="0" timeout="300"
>> primitive postgresql lsb:postgresql \
>> op monitor interval="10" timeout="60" \
>> op start interval="0" timeout="60" \
>> op stop interval="0" timeout="60"
>> primitive bind9 lsb:bind9 \
>> op monitor interval="10" timeout="60" \
>> op start interval="0" timeout="60" \
>> op stop interval="0" timeout="60"
>> primitive nginx lsb:nginx \
>> op monitor interval="10" timeout="60" \
>> op start interval="0" timeout="60" \
>> op stop interval="0" timeout="60"
>> group server fs postgresql bind9 nginx
>> ms ms_drbd drbd meta master-max="1" master-node-max="1" clone-max="2"
>> clone-node-max="1" notify="true"
>> location loc_server server rule $id="loc_server-rule" -inf: #uname eq node3
>> colocation col_server inf: server ms_drbd:Master
>> order ord_server inf: ms_drbd:promote server:start
>> property $id="cib-bootstrap-options" \
>> stonith-enabled="false" \
>> last-lrm-refresh="1421079189" \
>> maintenance-mode="false"
>>
>> _______________________________________________
>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>>
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: Avoid one node from being a target for resources migration [ In reply to ]

tomasz.kontusz at gmail

Jan 13, 2015, 12:17 AM

Post #3 of 3 (1002 views)

Permalink

Dmitry Koterov <dmitry.koterov@gmail.com> napisaÅ‚:
>Hello.
>
>I have 3-node cluster managed by corosync+pacemaker+crm. Node1 and
>Node2
>are DRBD master-slave, also they have a number of other services
>installed
>(postgresql, nginx, ...). Node3 is just a corosync node (for quorum),
>no
>DRBD/postgresql/... are installed at it, only corosync+pacemaker.

Quorum node can work with only corosync (and no pacemaker). It won't show up in crm_mon, but will affect quorum (at least in corosync 2).

>But when I add resources to the cluster, a part of them are somehow
>moved
>to node3 and since then fail. Note than I have a "colocation" directive
>to
>place these resources to the DRBD master only and "location" with -inf
>for
>node3, but this does not help - why? How to make pacemaker not run
>anything
>at node3?
>
>All the resources are added in a single transaction: "cat config.txt |
>crm
>-w -f- configure" where config.txt contains directives and "commit"
>statement at the end.
>
>Below are "crm status" (error messages) and "crm configure show"
>outputs.
>
>
>*root@node3:~# crm status*
>Current DC: node2 (1017525950) - partition with quorum
>3 Nodes configured
>6 Resources configured
>Online: [ node1 node2 node3 ]
>Master/Slave Set: ms_drbd [drbd]
> Masters: [ node1 ]
> Slaves: [ node2 ]
>Resource Group: server
> fs (ocf::heartbeat:Filesystem): Started node1
> postgresql (lsb:postgresql): Started node3 FAILED
> bind9 (lsb:bind9): Started node3 FAILED
> nginx (lsb:nginx): Started node3 (unmanaged) FAILED
>Failed actions:
> drbd_monitor_0 (node=node3, call=744, rc=5, status=complete,
>last-rc-change=Mon Jan 12 11:16:43 2015, queued=2ms, exec=0ms): not
>installed
> postgresql_monitor_0 (node=node3, call=753, rc=1, status=complete,
>last-rc-change=Mon Jan 12 11:16:43 2015, queued=8ms, exec=0ms): unknown
>error
> bind9_monitor_0 (node=node3, call=757, rc=1, status=complete,
>last-rc-change=Mon Jan 12 11:16:43 2015, queued=11ms, exec=0ms):
>unknown
>error
> nginx_stop_0 (node=node3, call=767, rc=5, status=complete,
>last-rc-change=Mon Jan 12 11:16:44 2015, queued=1ms, exec=0ms): not
>installed
>
>
>*root@node3:~# crm configure show | cat*
>node $id="1017525950" node2
>node $id="13071578" node3
>node $id="1760315215" node1
>primitive drbd ocf:linbit:drbd \
>params drbd_resource="vlv" \
>op start interval="0" timeout="240" \
>op stop interval="0" timeout="120"
>primitive fs ocf:heartbeat:Filesystem \
>params device="/dev/drbd0" directory="/var/lib/vlv.drbd/root"
>options="noatime,nodiratime" fstype="xfs" \
>op start interval="0" timeout="300" \
>op stop interval="0" timeout="300"
>primitive postgresql lsb:postgresql \
>op monitor interval="10" timeout="60" \
>op start interval="0" timeout="60" \
>op stop interval="0" timeout="60"
>primitive bind9 lsb:bind9 \
>op monitor interval="10" timeout="60" \
>op start interval="0" timeout="60" \
>op stop interval="0" timeout="60"
>primitive nginx lsb:nginx \
>op monitor interval="10" timeout="60" \
>op start interval="0" timeout="60" \
>op stop interval="0" timeout="60"
>group server fs postgresql bind9 nginx
>ms ms_drbd drbd meta master-max="1" master-node-max="1" clone-max="2"
>clone-node-max="1" notify="true"
>location loc_server server rule $id="loc_server-rule" -inf: #uname eq
>node3
>colocation col_server inf: server ms_drbd:Master
>order ord_server inf: ms_drbd:promote server:start
>property $id="cib-bootstrap-options" \
>stonith-enabled="false" \
>last-lrm-refresh="1421079189" \
>maintenance-mode="false"

It looks like you have a symmetric cluster. This makes pacemaker check each host for possibility of running a resource (even with -inf colocation).
You want something like this: http://clusterlabs.org/doc/en-US/Pacemaker/1.0/html/Pacemaker_Explained/ch06s02s02.html (or to only run corosync on that node)

>------------------------------------------------------------------------
>
>_______________________________________________
>Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
>http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
>Project Home: http://www.clusterlabs.org
>Getting started:
>http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>Bugs: http://bugs.clusterlabs.org

--
WysÅ‚ane za pomocÄ… K-9 Mail.

_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org