Mailing List Archive

Colocating with unmanaged resource
Hi,

Simple scenario, several floating IPs should be living on "front" nodes
only if there is working Nginx. There are several reasons against Nginx
being controlled by Pacemaker.

So, decided to colocate FIPs with unmanaged Nginx. This worked fine in
1.1.6 with some exceptions.

Later, on other cluster I decided to switch to 1.1.10 and corosync 2
because of performance improvements. Now also testing 1.1.12.

It seems I can't reliably colocate FIPs with unmanaged Nginx on 1.1.10
and 10.1.12.

Here are behaviors of different versions of pacemaker:

1.1.6, 1.1.10, 1.1.12:

- if Nginx has started on a node after initial probe for Nginx clone
then pacemaker will never see it running until cleanup or other probe
trigger

1.1.6:

- stopping nginx on a node makes the clone instance FAIL for that node,
FIP moves away from that node. This is as expected
- starting nginx removes FAIL state and FIP moves back. This is as
expected

1.1.10:

- stopping nginx on a node:
- usually makes the clone instance to FAIL for that node, but
FIP stays running on that node regardless of INF colocation
- sometime makes the clone instance to FAIL for that node and
immediately after that clone instance returns to STARTED state,
FIP stays running on that node
- sometimes makes the clone instance to be STOPPED for that node,
FIP moves away from that node. This is as expected
- starting nginx:
- if was FAIL: removes FAIL state: FIP remains running
- if was STARTED:
- usually nothing happens: FIP remains running
- sometimes makes clone instance to FAIL for that node, but
FIP stays running on that node regardless of INF colocation
- if was STOPPED: moves FIP back. This is as expected

1.1.12:

- stopping nginx on a node always makes the clone instance to FAIL for
that node, but FIP stays running on that node regardless of INF
colocation
- starting nginx removes FAIL state, FIP remains running

Please comment on this. And some questions:

- are unmanaged resources designed to be used in normal conditions for
other resources to be colocated with them? How to cook them right?
- is there a some kind of "recurring probe" to "see" unmanaged resources
that have started after initial probe?

Let me know if more logs needed, right now can't collect logs for all
cases, some attached.

Config for 1.1.10 (similar configs for 1.1.6 and 1.1.12):

node $id="..." pcmk10-1 \
attributes onhv="1" front="true"
node $id="..." pcmk10-2 \
attributes onhv="2" front="true"
node $id="..." pcmk10-3 \
attributes onhv="3" front="true"

primitive FIP_1 ocf:heartbeat:IPaddr2 \
op monitor interval="2s" \
params ip="10.1.1.1" cidr_netmask="16" \
meta migration-threshold="2" failure-timeout="60s"
primitive FIP_2 ocf:heartbeat:IPaddr2 \
op monitor interval="2s" \
params ip="10.1.2.1" cidr_netmask="16" \
meta migration-threshold="2" failure-timeout="60s"
primitive FIP_3 ocf:heartbeat:IPaddr2 \
op monitor interval="2s" \
params ip="10.1.3.1" cidr_netmask="16" \
meta migration-threshold="2" failure-timeout="60s"

primitive Nginx lsb:nginx \
op start interval="0" enabled="false" \
op stop interval="0" enabled="false" \
op monitor interval="2s"

clone cl_Nginx Nginx \
meta globally-unique="false" notify="false" is-managed="false"

location loc-cl_Nginx cl_Nginx \
rule $id="loc-cl_Nginx-r1" 500: front eq true

location loc-FIP_1 FIP_1 \
rule $id="loc-FIP_1-r1" 500: onhv eq 1 and front eq true \
rule $id="loc-FIP_1-r2" 200: defined onhv and onhv ne 1 and
front eq true
location loc-FIP_2 FIP_2 \
rule $id="loc-FIP_2-r1" 500: onhv eq 2 and front eq true \
rule $id="loc-FIP_2-r2" 200: defined onhv and onhv ne 2 and
front eq true
location loc-FIP_3 FIP_3 \
rule $id="loc-FIP_3-r1" 500: onhv eq 3 and front eq true \
rule $id="loc-FIP_3-r2" 200: defined onhv and onhv ne 3 and
front eq true

colocation coloc-FIP_1-cl_Nginx inf: FIP_1 cl_Nginx
colocation coloc-FIP_2-cl_Nginx inf: FIP_2 cl_Nginx
colocation coloc-FIP_3-cl_Nginx inf: FIP_3 cl_Nginx

property $id="cib-bootstrap-options" \
dc-version="1.1.10-42f2063" \
cluster-infrastructure="corosync" \
symmetric-cluster="false" \
stonith-enabled="false" \
no-quorum-policy="stop" \
cluster-recheck-interval="10s" \
maintenance-mode="false" \
last-lrm-refresh="1418998945"
rsc_defaults $id="rsc-options" \
resource-stickiness="30"
op_defaults $id="op_defaults-options" \
record-pending="false"
Re: Colocating with unmanaged resource [ In reply to ]
> On 20 Dec 2014, at 6:21 am, Покотиленко Костик <casper@meteor.dp.ua> wrote:
>
> Hi,
>
> Simple scenario, several floating IPs should be living on "front" nodes
> only if there is working Nginx. There are several reasons against Nginx
> being controlled by Pacemaker.
>
> So, decided to colocate FIPs with unmanaged Nginx. This worked fine in
> 1.1.6 with some exceptions.
>
> Later, on other cluster I decided to switch to 1.1.10 and corosync 2
> because of performance improvements. Now also testing 1.1.12.
>
> It seems I can't reliably colocate FIPs with unmanaged Nginx on 1.1.10
> and 10.1.12.
>
> Here are behaviors of different versions of pacemaker:
>
> 1.1.6, 1.1.10, 1.1.12:
>
> - if Nginx has started on a node after initial probe for Nginx clone
> then pacemaker will never see it running until cleanup or other probe
> trigger

you'll want a recurring monitor with role=Stopped

>
> 1.1.6:
>
> - stopping nginx on a node makes the clone instance FAIL for that node,
> FIP moves away from that node. This is as expected
> - starting nginx removes FAIL state and FIP moves back. This is as
> expected
>
> 1.1.10:
>
> - stopping nginx on a node:
> - usually makes the clone instance to FAIL for that node, but
> FIP stays running on that node regardless of INF colocation
> - sometime makes the clone instance to FAIL for that node and
> immediately after that clone instance returns to STARTED state,
> FIP stays running on that node
> - sometimes makes the clone instance to be STOPPED for that node,
> FIP moves away from that node. This is as expected
> - starting nginx:
> - if was FAIL: removes FAIL state: FIP remains running
> - if was STARTED:
> - usually nothing happens: FIP remains running
> - sometimes makes clone instance to FAIL for that node, but
> FIP stays running on that node regardless of INF colocation
> - if was STOPPED: moves FIP back. This is as expected
>
> 1.1.12:
>
> - stopping nginx on a node always makes the clone instance to FAIL for
> that node, but FIP stays running on that node regardless of INF
> colocation

can you attach a crm_report of the above test please?


> - starting nginx removes FAIL state, FIP remains running
>
> Please comment on this. And some questions:
>
> - are unmanaged resources designed to be used in normal conditions for
> other resources to be colocated with them? How to cook them right?
> - is there a some kind of "recurring probe" to "see" unmanaged resources
> that have started after initial probe?
>
> Let me know if more logs needed, right now can't collect logs for all
> cases, some attached.
>
> Config for 1.1.10 (similar configs for 1.1.6 and 1.1.12):
>
> node $id="..." pcmk10-1 \
> attributes onhv="1" front="true"
> node $id="..." pcmk10-2 \
> attributes onhv="2" front="true"
> node $id="..." pcmk10-3 \
> attributes onhv="3" front="true"
>
> primitive FIP_1 ocf:heartbeat:IPaddr2 \
> op monitor interval="2s" \
> params ip="10.1.1.1" cidr_netmask="16" \
> meta migration-threshold="2" failure-timeout="60s"
> primitive FIP_2 ocf:heartbeat:IPaddr2 \
> op monitor interval="2s" \
> params ip="10.1.2.1" cidr_netmask="16" \
> meta migration-threshold="2" failure-timeout="60s"
> primitive FIP_3 ocf:heartbeat:IPaddr2 \
> op monitor interval="2s" \
> params ip="10.1.3.1" cidr_netmask="16" \
> meta migration-threshold="2" failure-timeout="60s"
>
> primitive Nginx lsb:nginx \
> op start interval="0" enabled="false" \
> op stop interval="0" enabled="false" \
> op monitor interval="2s"
>
> clone cl_Nginx Nginx \
> meta globally-unique="false" notify="false" is-managed="false"
>
> location loc-cl_Nginx cl_Nginx \
> rule $id="loc-cl_Nginx-r1" 500: front eq true
>
> location loc-FIP_1 FIP_1 \
> rule $id="loc-FIP_1-r1" 500: onhv eq 1 and front eq true \
> rule $id="loc-FIP_1-r2" 200: defined onhv and onhv ne 1 and
> front eq true
> location loc-FIP_2 FIP_2 \
> rule $id="loc-FIP_2-r1" 500: onhv eq 2 and front eq true \
> rule $id="loc-FIP_2-r2" 200: defined onhv and onhv ne 2 and
> front eq true
> location loc-FIP_3 FIP_3 \
> rule $id="loc-FIP_3-r1" 500: onhv eq 3 and front eq true \
> rule $id="loc-FIP_3-r2" 200: defined onhv and onhv ne 3 and
> front eq true
>
> colocation coloc-FIP_1-cl_Nginx inf: FIP_1 cl_Nginx
> colocation coloc-FIP_2-cl_Nginx inf: FIP_2 cl_Nginx
> colocation coloc-FIP_3-cl_Nginx inf: FIP_3 cl_Nginx
>
> property $id="cib-bootstrap-options" \
> dc-version="1.1.10-42f2063" \
> cluster-infrastructure="corosync" \
> symmetric-cluster="false" \
> stonith-enabled="false" \
> no-quorum-policy="stop" \
> cluster-recheck-interval="10s" \
> maintenance-mode="false" \
> last-lrm-refresh="1418998945"
> rsc_defaults $id="rsc-options" \
> resource-stickiness="30"
> op_defaults $id="op_defaults-options" \
> record-pending="false"
>
> <1.1.10_fail-started.log><1.1.10_stopped-started.log>_______________________________________________
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org


_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org
Re: Colocating with unmanaged resource [ In reply to ]
> On 15 Jan 2015, at 12:54 am, Покотиленко Костик <casper@meteor.dp.ua> wrote:
>
> В Вто, 06/01/2015 в 16:27 +1100, Andrew Beekhof пишет:
>>> On 20 Dec 2014, at 6:21 am, Покотиленко Костик <casper@meteor.dp.ua> wrote:
>>> Here are behaviors of different versions of pacemaker:
>>>
>>> 1.1.12:
>>>
>>> - stopping nginx on a node always makes the clone instance to FAIL for
>>> that node, but FIP stays running on that node regardless of INF
>>> colocation
>>
>> can you attach a crm_report of the above test please?
>
> crm_report of this test attached as
> pcmk-nginx-fail-Wed-14-Jan-2015.tar.bz2

is there a reason nginx is not managed?
if it wasn't, then we'd have stopped it and FIP_2 would have been moved

>
>> 1.1.6, 1.1.10, 1.1.12:
>>
>>> - if Nginx has started on a node after initial probe for Nginx clone
>>> then pacemaker will never see it running until cleanup or other
> probe
>>> trigger
>>
>> you'll want a recurring monitor with role=Stopped
>>
>
> How is it done?

I don't know the crmsh syntax. Sorry

>
> I've tried on 1.1.12 with:
> primitive Nginx lsb:nginx \
> op monitor interval=2s \
> op monitor interval=3s role=Stopped
>
> This produces warning that monitor_stopped may be unsupported by RA.

I'm not familiar with that warning.
Where did you see it?

> Should it?
> And it's not recognizing start of nginx.

It seems role=Stopped only works for primitives (not clones)
I've made a note to get this fixed

>
> Steps:
> - stop nginx on 2nd node
> - cleanup cl_Nginx so that pacemaker forget nginx was running in 2nd
> node
> - clear logs
> - start nginx
> - nothing happens
> - make crm_report
>
> crm_report of this test attached as
> pcmk-monitor-stopped-Wed-14-Jan-2015.tar.bz2
>
> <pcmk-monitor-stopped-Wed-14-Jan-2015.tar.bz2><pcmk-nginx-fail-Wed-14-Jan-2015.tar.bz2>


_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org
Re: Colocating with unmanaged resource [ In reply to ]
Andrew Beekhof <andrew@beekhof.net> writes:

>>> you'll want a recurring monitor with role=Stopped
>>>
>>
>> How is it done?
>
> I don't know the crmsh syntax. Sorry
>
>>
>> I've tried on 1.1.12 with:
>> primitive Nginx lsb:nginx \
>> op monitor interval=2s \
>> op monitor interval=3s role=Stopped
>>
>> This produces warning that monitor_stopped may be unsupported by RA.
>

To clarify, the above is indeed the correct crmsh syntax for a monitor
op with role=Stopped.


--
// Kristoffer Grönlund
// kgronlund@suse.com

_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org
Re: Colocating with unmanaged resource [ In reply to ]
В Чтв, 22/01/2015 в 14:59 +1100, Andrew Beekhof пишет:
> > On 15 Jan 2015, at 12:54 am, Покотиленко Костик <casper@meteor.dp.ua> wrote:
> >
> > В Вто, 06/01/2015 в 16:27 +1100, Andrew Beekhof пишет:
> >>> On 20 Dec 2014, at 6:21 am, Покотиленко Костик <casper@meteor.dp.ua> wrote:
> >>> Here are behaviors of different versions of pacemaker:
> >>>
> >>> 1.1.12:
> >>>
> >>> - stopping nginx on a node always makes the clone instance to FAIL for
> >>> that node, but FIP stays running on that node regardless of INF
> >>> colocation
> >>
> >> can you attach a crm_report of the above test please?
> >
> > crm_report of this test attached as
> > pcmk-nginx-fail-Wed-14-Jan-2015.tar.bz2
>
> is there a reason nginx is not managed?
> if it wasn't, then we'd have stopped it and FIP_2 would have been moved

I'm not sure I got this right.

Nginx is not managed by intention (is-managed="false") that's why subj.
And the whole subject is in fact that stopping unmanaged nginx doesn't
move away FIP which is INF colocated with it (this is regarding 1.1.12,
1.1.6 works fine).

> >> 1.1.6, 1.1.10, 1.1.12:
> >>
> >>> - if Nginx has started on a node after initial probe for Nginx clone
> >>> then pacemaker will never see it running until cleanup or other
> > probe
> >>> trigger
> >>
> >> you'll want a recurring monitor with role=Stopped
> >>
> >
> > How is it done?
>
> I don't know the crmsh syntax. Sorry
>
> >
> > I've tried on 1.1.12 with:
> > primitive Nginx lsb:nginx \
> > op monitor interval=2s \
> > op monitor interval=3s role=Stopped
> >
> > This produces warning that monitor_stopped may be unsupported by RA.
>
> I'm not familiar with that warning.
> Where did you see it?

The exact text is:
WARNING: Nginx: action monitor_Stopped not advertised in meta-data, it may not be supported by the RA

This is produced by crm configure edit,

> > Should it?
> > And it's not recognizing start of nginx.
>
> It seems role=Stopped only works for primitives (not clones)
> I've made a note to get this fixed

This will add usability for unmanaged resources, thanks.

> >
> > Steps:
> > - stop nginx on 2nd node
> > - cleanup cl_Nginx so that pacemaker forget nginx was running in 2nd
> > node
> > - clear logs
> > - start nginx
> > - nothing happens
> > - make crm_report
> >
> > crm_report of this test attached as
> > pcmk-monitor-stopped-Wed-14-Jan-2015.tar.bz2
> >
> > <pcmk-monitor-stopped-Wed-14-Jan-2015.tar.bz2><pcmk-nginx-fail-Wed-14-Jan-2015.tar.bz2>
>






_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org
Re: Colocating with unmanaged resource [ In reply to ]
> On 28 Feb 2015, at 6:00 am, Покотиленко Костик <casper@meteor.dp.ua> wrote:
>
> В Чтв, 22/01/2015 в 14:59 +1100, Andrew Beekhof пишет:
>>> On 15 Jan 2015, at 12:54 am, Покотиленко Костик <casper@meteor.dp.ua> wrote:
>>>
>>> В Вто, 06/01/2015 в 16:27 +1100, Andrew Beekhof пишет:
>>>>> On 20 Dec 2014, at 6:21 am, Покотиленко Костик <casper@meteor.dp.ua> wrote:
>>>>> Here are behaviors of different versions of pacemaker:
>>>>>
>>>>> 1.1.12:
>>>>>
>>>>> - stopping nginx on a node always makes the clone instance to FAIL for
>>>>> that node, but FIP stays running on that node regardless of INF
>>>>> colocation
>>>>
>>>> can you attach a crm_report of the above test please?
>>>
>>> crm_report of this test attached as
>>> pcmk-nginx-fail-Wed-14-Jan-2015.tar.bz2
>>
>> is there a reason nginx is not managed?
>> if it wasn't, then we'd have stopped it and FIP_2 would have been moved
>
> I'm not sure I got this right.
>
> Nginx is not managed by intention (is-managed="false") that's why subj.
> And the whole subject is in fact that stopping unmanaged nginx doesn't
> move away FIP which is INF colocated with it (this is regarding 1.1.12,
> 1.1.6 works fine).

Ahhhh.
We changed the way monitors that return OCF_NOT_RUNNING were handled to still require a stop under most conditions.
I've added "not managed" to the list of exceptions:

diff --git a/lib/pengine/unpack.c b/lib/pengine/unpack.c
index 308258d..6dc44fd 100644
--- a/lib/pengine/unpack.c
+++ b/lib/pengine/unpack.c
@@ -2689,7 +2689,7 @@ determine_op_status(
break;

case PCMK_OCF_NOT_RUNNING:
- if (is_probe || target_rc == rc) {
+ if (is_probe || target_rc == rc || is_not_set(rsc->flags, pe_rsc_managed)) {
result = PCMK_LRM_OP_DONE;
rsc->role = RSC_ROLE_STOPPED;

Look for this in 1.1.13-rc2

>
>>>> 1.1.6, 1.1.10, 1.1.12:
>>>>
>>>>> - if Nginx has started on a node after initial probe for Nginx clone
>>>>> then pacemaker will never see it running until cleanup or other
>>> probe
>>>>> trigger
>>>>
>>>> you'll want a recurring monitor with role=Stopped
>>>>
>>>
>>> How is it done?
>>
>> I don't know the crmsh syntax. Sorry
>>
>>>
>>> I've tried on 1.1.12 with:
>>> primitive Nginx lsb:nginx \
>>> op monitor interval=2s \
>>> op monitor interval=3s role=Stopped
>>>
>>> This produces warning that monitor_stopped may be unsupported by RA.
>>
>> I'm not familiar with that warning.
>> Where did you see it?
>
> The exact text is:
> WARNING: Nginx: action monitor_Stopped not advertised in meta-data, it may not be supported by the RA
>
> This is produced by crm configure edit,

Hmmm, you'd have to take that up with the crmsh maintainers.

>
>>> Should it?
>>> And it's not recognizing start of nginx.
>>
>> It seems role=Stopped only works for primitives (not clones)
>> I've made a note to get this fixed
>
> This will add usability for unmanaged resources, thanks.
>
>>>
>>> Steps:
>>> - stop nginx on 2nd node
>>> - cleanup cl_Nginx so that pacemaker forget nginx was running in 2nd
>>> node
>>> - clear logs
>>> - start nginx
>>> - nothing happens
>>> - make crm_report
>>>
>>> crm_report of this test attached as
>>> pcmk-monitor-stopped-Wed-14-Jan-2015.tar.bz2
>>>
>>> <pcmk-monitor-stopped-Wed-14-Jan-2015.tar.bz2><pcmk-nginx-fail-Wed-14-Jan-2015.tar.bz2>


_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org