Mailing List Archive: pacemaker/corosync: a resource is started on 2 nodes

pacemaker/corosync: a resource is started on 2 nodes

Jan 28, 2015, 2:20 AM

Post #1 of 4 (2314 views)

Hi!

I have a small corosync/pacemaker based cluster which consists of 4 nodes. 2 nodes are in standby mode, another 2 actually handle all the resources.

corosync ver. 1.4.7-1.
pacemaker ver 1.1.11.
os: ubuntu 12.04.

Inside our production environment which has a plenty of free ram,cpu etc everything is working well. When I switch one node off all the resources move to another without any problems. And vice versa. That's what I need :)

Our staging environment has rather weak hardware (that's ok - it's just staging :) ) and is rather busy. Sometimes it even doesn't have enough cpu or disk speed to be stable. When that happens some of cluster resources fail (which I consider to be normal), but also I can see the following crm output:

Node db-node1: standby
Node db-node2: standby
Online: [ lb-node1 lb-node2 ]

Pgpool2 (ocf::heartbeat:pgpool): FAILED (unmanaged) [ lb-node2 lb-node1 ]
Resource Group: IPGroup
FailoverIP1 (ocf::heartbeat:IPaddr2): Started [ lb-node2 lb-node1 ]

As you can see the resource ocf::heartbeat:IPaddr2 is started on both nodes ( lb-node2 and lb-node1 ). But I can't figure out how than could happen.

this is the output of my crm configure show:

node db-node1 \
attributes standby=on
node db-node2 \
attributes standby=on
node lb-node1
node lb-node2
primitive Cachier ocf:site:cachier \
op monitor interval=10s timeout=30s depth=10 \
meta target-role=Started
primitive FailoverIP1 IPaddr2 \
params ip=111.22.33.44 cidr_netmask=32 iflabel=FAILOVER \
op monitor interval=30s
primitive Mailer ocf:site:mailer \
meta target-role=Started \
op monitor interval=10s timeout=30s depth=10
primitive Memcached memcached \
op monitor interval=10s timeout=30s depth=10 \
meta target-role=Started
primitive Nginx nginx \
params status10url="/nginx_status" testclient=curl port=8091 \
op monitor interval=10s timeout=30s depth=10 \
op start interval=0 timeout=40s \
op stop interval=0 timeout=60s \
meta target-role=Started
primitive Pgpool2 pgpool \
params checkmethod=pid \
op monitor interval=30s \
op start interval=0 timeout=40s \
op stop interval=0 timeout=60s
group IPGroup FailoverIP1 \
meta target-role=Started
colocation ip-with-cachier inf: Cachier IPGroup
colocation ip-with-mailer inf: Mailer IPGroup
colocation ip-with-memcached inf: Memcached IPGroup
colocation ip-with-nginx inf: Nginx IPGroup
colocation ip-with-pgpool inf: Pgpool2 IPGroup
order cachier-after-ip inf: IPGroup Cachier
order mailer-after-ip inf: IPGroup Mailer
order memcached-after-ip inf: IPGroup Memcached
order nginx-after-ip inf: IPGroup Nginx
order pgpool-after-ip inf: IPGroup Pgpool2
property cib-bootstrap-options: \
expected-quorum-votes=4 \
stonith-enabled=false \
default-resource-stickiness=100 \
maintenance-mode=false \
dc-version=1.1.10-9d39a6b \
cluster-infrastructure="classic openais (with plugin)" \
last-lrm-refresh=1422438144

So the question is - does my config allow a resource like ocf::heartbeat:IPaddr2 to be started on multiple nodes simultaneously? Is it something that normally can happen? Or is it happening because of the shortage of computing power which i described earlier? : )
How can I prevent a thing like this from happening? Is it a case which normally is supposed to be solved by STONITH?

Thanks in advance.

--
Best regards,
Sergey Arlashin

_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: pacemaker/corosync: a resource is started on 2 nodes [ In reply to ]

ms at sys4

Jan 28, 2015, 4:49 AM

Post #2 of 4 (2234 views)

Permalink

Am Mittwoch, 28. Januar 2015, 14:20:51 schrieb Sergey Arlashin:
> Hi!
>
> I have a small corosync/pacemaker based cluster which consists of 4 nodes. 2
> nodes are in standby mode, another 2 actually handle all the resources.
>
> corosync ver. 1.4.7-1.
> pacemaker ver 1.1.11.
> os: ubuntu 12.04.
>
> Inside our production environment which has a plenty of free ram,cpu etc
> everything is working well. When I switch one node off all the resources
> move to another without any problems. And vice versa. That's what I need :)
>
> Our staging environment has rather weak hardware (that's ok - it's just
> staging :) ) and is rather busy. Sometimes it even doesn't have enough cpu
> or disk speed to be stable. When that happens some of cluster resources
> fail (which I consider to be normal), but also I can see the following crm
> output:
>
> Node db-node1: standby
> Node db-node2: standby
> Online: [ lb-node1 lb-node2 ]
>
> Pgpool2 (ocf::heartbeat:pgpool): FAILED (unmanaged) [ lb-node2 lb-node1 ]
> Resource Group: IPGroup
> FailoverIP1 (ocf::heartbeat:IPaddr2): Started [ lb-node2 lb-node1 ]
>
> As you can see the resource ocf::heartbeat:IPaddr2 is started on both nodes
> ( lb-node2 and lb-node1 ). But I can't figure out how than could happen.

Your config does not allow this, but since your HW is slow pacemaker runs into
timeouts and corosync conneciton problems. You could debug the problem be
tracing the event in the logs. With the command crm_mon -1rtf you find the time
of the failure. Search around that time in the logs.

If the communication in the cluster does not work, pacemaker sometimes behaves
verry odd.

Mit freundlichen Grüßen,

Michael Schwartzkopff

--
[*] sys4 AG

http://sys4.de, +49 (89) 30 90 46 64, +49 (162) 165 0044
Franziskanerstraße 15, 81669 München

Sitz der Gesellschaft: München, Amtsgericht München: HRB 199263
Vorstand: Patrick Ben Koetter, Marc Schiffbauer
Aufsichtsratsvorsitzender: Florian Kirstein

_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: pacemaker/corosync: a resource is started on 2 nodes [ In reply to ]

andrew at beekhof

Feb 23, 2015, 3:16 PM

Post #3 of 4 (2183 views)

Permalink

> On 28 Jan 2015, at 9:20 pm, Sergey Arlashin <sergeyarl.maillist@gmail.com> wrote:
>
> Hi!
>
> I have a small corosync/pacemaker based cluster which consists of 4 nodes. 2 nodes are in standby mode, another 2 actually handle all the resources.
>
> corosync ver. 1.4.7-1.
> pacemaker ver 1.1.11.
> os: ubuntu 12.04.
>
> Inside our production environment which has a plenty of free ram,cpu etc everything is working well. When I switch one node off all the resources move to another without any problems. And vice versa. That's what I need :)
>
> Our staging environment has rather weak hardware (that's ok - it's just staging :) ) and is rather busy. Sometimes it even doesn't have enough cpu or disk speed to be stable. When that happens some of cluster resources fail (which I consider to be normal), but also I can see the following crm output:
>
> Node db-node1: standby
> Node db-node2: standby
> Online: [ lb-node1 lb-node2 ]
>
> Pgpool2 (ocf::heartbeat:pgpool): FAILED (unmanaged) [ lb-node2 lb-node1 ]
> Resource Group: IPGroup
> FailoverIP1 (ocf::heartbeat:IPaddr2): Started [ lb-node2 lb-node1 ]
>
> As you can see the resource ocf::heartbeat:IPaddr2 is started on both nodes ( lb-node2 and lb-node1 ). But I can't figure out how than could happen.

stonith-enabled=false is one especially good way.
particularly in an unstable environment.

it could even be that it is showing up as running due to failed monitor operations and is not actually running there (but for safety we have to assume it is)

>
> this is the output of my crm configure show:
>
> node db-node1 \
> attributes standby=on
> node db-node2 \
> attributes standby=on
> node lb-node1
> node lb-node2
> primitive Cachier ocf:site:cachier \
> op monitor interval=10s timeout=30s depth=10 \
> meta target-role=Started
> primitive FailoverIP1 IPaddr2 \
> params ip=111.22.33.44 cidr_netmask=32 iflabel=FAILOVER \
> op monitor interval=30s
> primitive Mailer ocf:site:mailer \
> meta target-role=Started \
> op monitor interval=10s timeout=30s depth=10
> primitive Memcached memcached \
> op monitor interval=10s timeout=30s depth=10 \
> meta target-role=Started
> primitive Nginx nginx \
> params status10url="/nginx_status" testclient=curl port=8091 \
> op monitor interval=10s timeout=30s depth=10 \
> op start interval=0 timeout=40s \
> op stop interval=0 timeout=60s \
> meta target-role=Started
> primitive Pgpool2 pgpool \
> params checkmethod=pid \
> op monitor interval=30s \
> op start interval=0 timeout=40s \
> op stop interval=0 timeout=60s
> group IPGroup FailoverIP1 \
> meta target-role=Started
> colocation ip-with-cachier inf: Cachier IPGroup
> colocation ip-with-mailer inf: Mailer IPGroup
> colocation ip-with-memcached inf: Memcached IPGroup
> colocation ip-with-nginx inf: Nginx IPGroup
> colocation ip-with-pgpool inf: Pgpool2 IPGroup
> order cachier-after-ip inf: IPGroup Cachier
> order mailer-after-ip inf: IPGroup Mailer
> order memcached-after-ip inf: IPGroup Memcached
> order nginx-after-ip inf: IPGroup Nginx
> order pgpool-after-ip inf: IPGroup Pgpool2
> property cib-bootstrap-options: \
> expected-quorum-votes=4 \
> stonith-enabled=false \
> default-resource-stickiness=100 \
> maintenance-mode=false \
> dc-version=1.1.10-9d39a6b \
> cluster-infrastructure="classic openais (with plugin)" \
> last-lrm-refresh=1422438144
>
>
> So the question is - does my config allow a resource like ocf::heartbeat:IPaddr2 to be started on multiple nodes simultaneously? Is it something that normally can happen? Or is it happening because of the shortage of computing power which i described earlier? : )
> How can I prevent a thing like this from happening? Is it a case which normally is supposed to be solved by STONITH?
>
> Thanks in advance.
>
> --
> Best regards,
> Sergey Arlashin
>
>
>
>
>
>
>
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org