Mailing List Archive: Long failover

Long failover

Nov 14, 2014, 3:57 AM

Post #1 of 11 (4690 views)

Hello,

We have a cluster configured via pacemaker+corosync+crm. The configuration is:

node master
node slave
primitive HA-VIP1 IPaddr2 \
params ip=192.168.22.71 nic=bond0 \
op monitor interval=1s
primitive HA-variator lsb: variator \
op monitor interval=1s \
meta migration-threshold=1 failure-timeout=1s
group HA-Group HA-VIP1 HA-variator
property cib-bootstrap-options: \
dc-version=1.1.10-14.el6-368c726 \
cluster-infrastructure="classic openais (with plugin)" \
expected-quorum-votes=2 \
stonith-enabled=false \
no-quorum-policy=ignore \
last-lrm-refresh=1383871087
rsc_defaults rsc-options: \
resource-stickiness=100

Firstly I make the variator service down on the master node (actually I delete the service binary and kill the variator process, so the variator fails to restart). Resources very quickly move on the slave node as expected. Then I return the binary on the master and restart the variator service. Now I make the same stuff with binary and service on slave node. The crm status command quickly shows me HA-variator (lsb: variator): Stopped. But it take to much time (for us) before recourses are switched on the master node (around 1 min). Then line
Failed actions:
HA- variator _monitor_1000 on slave 'unknown error' (1): call=-1, status=Timed Out, last-rc-change='Sat Dec 21 03:59:45 2013', queued=0ms, exec=0ms
appears in the crm status and recourses are switched.

What is that timeout? Where I can change it?

------------------------
Kind regards,
Dmitriy Matveichev.

Re: Long failover [ In reply to ]

arvidjaar at gmail

Nov 14, 2014, 5:12 AM

Post #2 of 11 (4646 views)

Permalink

On Fri, Nov 14, 2014 at 2:57 PM, Dmitry Matveichev
<d.matveichev@mfisoft.ru> wrote:
> Hello,
>
>
>
> We have a cluster configured via pacemaker+corosync+crm. The configuration
> is:
>
>
>
> node master
>
> node slave
>
> primitive HA-VIP1 IPaddr2 \
>
> params ip=192.168.22.71 nic=bond0 \
>
> op monitor interval=1s
>
> primitive HA-variator lsb: variator \
>
> op monitor interval=1s \
>
> meta migration-threshold=1 failure-timeout=1s
>
> group HA-Group HA-VIP1 HA-variator
>
> property cib-bootstrap-options: \
>
> dc-version=1.1.10-14.el6-368c726 \
>
> cluster-infrastructure="classic openais (with plugin)" \
>
> expected-quorum-votes=2 \
>
> stonith-enabled=false \
>
> no-quorum-policy=ignore \
>
> last-lrm-refresh=1383871087
>
> rsc_defaults rsc-options: \
>
> resource-stickiness=100
>
>
>
> Firstly I make the variator service down on the master node (actually I
> delete the service binary and kill the variator process, so the variator
> fails to restart). Resources very quickly move on the slave node as
> expected. Then I return the binary on the master and restart the variator
> service. Now I make the same stuff with binary and service on slave node.
> The crm status command quickly shows me HA-variator (lsb: variator):
> Stopped. But it take to much time (for us) before recourses are switched on
> the master node (around 1 min). Then line
>
> Failed actions:
>
> HA- variator _monitor_1000 on slave 'unknown error' (1): call=-1,
> status=Timed Out, last-rc-change='Sat Dec 21 03:59:45 2013', queued=0ms,
> exec=0ms
>
> appears in the crm status and recourses are switched.
>
>
>
> What is that timeout? Where I can change it?
>

This is operation timeout. You can change it in operation definition:
op monitor interval=1s timeout=5s

_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: Long failover [ In reply to ]

d.matveichev at mfisoft

Nov 14, 2014, 5:33 AM

Post #3 of 11 (4640 views)

Permalink

We've already tried to set it but it didn't help.

------------------------
Kind regards,
Dmitriy Matveichev.

-----Original Message-----
From: Andrei Borzenkov [mailto:arvidjaar@gmail.com]
Sent: Friday, November 14, 2014 4:12 PM
To: The Pacemaker cluster resource manager
Subject: Re: [Pacemaker] Long failover

On Fri, Nov 14, 2014 at 2:57 PM, Dmitry Matveichev <d.matveichev@mfisoft.ru> wrote:
> Hello,
>
>
>
> We have a cluster configured via pacemaker+corosync+crm. The
> configuration
> is:
>
>
>
> node master
>
> node slave
>
> primitive HA-VIP1 IPaddr2 \
>
> params ip=192.168.22.71 nic=bond0 \
>
> op monitor interval=1s
>
> primitive HA-variator lsb: variator \
>
> op monitor interval=1s \
>
> meta migration-threshold=1 failure-timeout=1s
>
> group HA-Group HA-VIP1 HA-variator
>
> property cib-bootstrap-options: \
>
> dc-version=1.1.10-14.el6-368c726 \
>
> cluster-infrastructure="classic openais (with plugin)" \
>
> expected-quorum-votes=2 \
>
> stonith-enabled=false \
>
> no-quorum-policy=ignore \
>
> last-lrm-refresh=1383871087
>
> rsc_defaults rsc-options: \
>
> resource-stickiness=100
>
>
>
> Firstly I make the variator service down on the master node (actually
> I delete the service binary and kill the variator process, so the
> variator fails to restart). Resources very quickly move on the slave
> node as expected. Then I return the binary on the master and restart
> the variator service. Now I make the same stuff with binary and service on slave node.
> The crm status command quickly shows me HA-variator (lsb: variator):
> Stopped. But it take to much time (for us) before recourses are switched on
> the master node (around 1 min). Then line
>
> Failed actions:
>
> HA- variator _monitor_1000 on slave 'unknown error' (1): call=-1,
> status=Timed Out, last-rc-change='Sat Dec 21 03:59:45 2013',
> queued=0ms, exec=0ms
>
> appears in the crm status and recourses are switched.
>
>
>
> What is that timeout? Where I can change it?
>

This is operation timeout. You can change it in operation definition:
op monitor interval=1s timeout=5s

_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: Long failover [ In reply to ]

arvidjaar at gmail

Nov 14, 2014, 5:44 AM

Post #4 of 11 (4639 views)

Permalink

On Fri, Nov 14, 2014 at 4:33 PM, Dmitry Matveichev
<d.matveichev@mfisoft.ru> wrote:
> We've already tried to set it but it didn't help.
>

I doubt it is possible to say anything without logs.

> ------------------------
> Kind regards,
> Dmitriy Matveichev.
>
>
> -----Original Message-----
> From: Andrei Borzenkov [mailto:arvidjaar@gmail.com]
> Sent: Friday, November 14, 2014 4:12 PM
> To: The Pacemaker cluster resource manager
> Subject: Re: [Pacemaker] Long failover
>
> On Fri, Nov 14, 2014 at 2:57 PM, Dmitry Matveichev <d.matveichev@mfisoft.ru> wrote:
>> Hello,
>>
>>
>>
>> We have a cluster configured via pacemaker+corosync+crm. The
>> configuration
>> is:
>>
>>
>>
>> node master
>>
>> node slave
>>
>> primitive HA-VIP1 IPaddr2 \
>>
>> params ip=192.168.22.71 nic=bond0 \
>>
>> op monitor interval=1s
>>
>> primitive HA-variator lsb: variator \
>>
>> op monitor interval=1s \
>>
>> meta migration-threshold=1 failure-timeout=1s
>>
>> group HA-Group HA-VIP1 HA-variator
>>
>> property cib-bootstrap-options: \
>>
>> dc-version=1.1.10-14.el6-368c726 \
>>
>> cluster-infrastructure="classic openais (with plugin)" \
>>
>> expected-quorum-votes=2 \
>>
>> stonith-enabled=false \
>>
>> no-quorum-policy=ignore \
>>
>> last-lrm-refresh=1383871087
>>
>> rsc_defaults rsc-options: \
>>
>> resource-stickiness=100
>>
>>
>>
>> Firstly I make the variator service down on the master node (actually
>> I delete the service binary and kill the variator process, so the
>> variator fails to restart). Resources very quickly move on the slave
>> node as expected. Then I return the binary on the master and restart
>> the variator service. Now I make the same stuff with binary and service on slave node.
>> The crm status command quickly shows me HA-variator (lsb: variator):
>> Stopped. But it take to much time (for us) before recourses are switched on
>> the master node (around 1 min). Then line
>>
>> Failed actions:
>>
>> HA- variator _monitor_1000 on slave 'unknown error' (1): call=-1,
>> status=Timed Out, last-rc-change='Sat Dec 21 03:59:45 2013',
>> queued=0ms, exec=0ms
>>
>> appears in the crm status and recourses are switched.
>>
>>
>>
>> What is that timeout? Where I can change it?
>>
>
> This is operation timeout. You can change it in operation definition:
> op monitor interval=1s timeout=5s
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: Long failover [ In reply to ]

d.matveichev at mfisoft

Nov 14, 2014, 6:31 AM

Post #5 of 11 (4637 views)

Permalink

Please find attached.

------------------------
Kind regards,
Dmitriy Matveichev.

-----Original Message-----
From: Andrei Borzenkov [mailto:arvidjaar@gmail.com]
Sent: Friday, November 14, 2014 4:44 PM
To: The Pacemaker cluster resource manager
Subject: Re: [Pacemaker] Long failover

On Fri, Nov 14, 2014 at 4:33 PM, Dmitry Matveichev <d.matveichev@mfisoft.ru> wrote:
> We've already tried to set it but it didn't help.
>

I doubt it is possible to say anything without logs.

> ------------------------
> Kind regards,
> Dmitriy Matveichev.
>
>
> -----Original Message-----
> From: Andrei Borzenkov [mailto:arvidjaar@gmail.com]
> Sent: Friday, November 14, 2014 4:12 PM
> To: The Pacemaker cluster resource manager
> Subject: Re: [Pacemaker] Long failover
>
> On Fri, Nov 14, 2014 at 2:57 PM, Dmitry Matveichev <d.matveichev@mfisoft.ru> wrote:
>> Hello,
>>
>>
>>
>> We have a cluster configured via pacemaker+corosync+crm. The
>> configuration
>> is:
>>
>>
>>
>> node master
>>
>> node slave
>>
>> primitive HA-VIP1 IPaddr2 \
>>
>> params ip=192.168.22.71 nic=bond0 \
>>
>> op monitor interval=1s
>>
>> primitive HA-variator lsb: variator \
>>
>> op monitor interval=1s \
>>
>> meta migration-threshold=1 failure-timeout=1s
>>
>> group HA-Group HA-VIP1 HA-variator
>>
>> property cib-bootstrap-options: \
>>
>> dc-version=1.1.10-14.el6-368c726 \
>>
>> cluster-infrastructure="classic openais (with plugin)" \
>>
>> expected-quorum-votes=2 \
>>
>> stonith-enabled=false \
>>
>> no-quorum-policy=ignore \
>>
>> last-lrm-refresh=1383871087
>>
>> rsc_defaults rsc-options: \
>>
>> resource-stickiness=100
>>
>>
>>
>> Firstly I make the variator service down on the master node
>> (actually I delete the service binary and kill the variator process,
>> so the variator fails to restart). Resources very quickly move on the
>> slave node as expected. Then I return the binary on the master and
>> restart the variator service. Now I make the same stuff with binary and service on slave node.
>> The crm status command quickly shows me HA-variator (lsb: variator):
>> Stopped. But it take to much time (for us) before recourses are switched on
>> the master node (around 1 min). Then line
>>
>> Failed actions:
>>
>> HA- variator _monitor_1000 on slave 'unknown error' (1): call=-1,
>> status=Timed Out, last-rc-change='Sat Dec 21 03:59:45 2013',
>> queued=0ms, exec=0ms
>>
>> appears in the crm status and recourses are switched.
>>
>>
>>
>> What is that timeout? Where I can change it?
>>
>
> This is operation timeout. You can change it in operation definition:
> op monitor interval=1s timeout=5s
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org Getting started:
> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org Getting started:
> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: Long failover [ In reply to ]

andrew at beekhof

Nov 16, 2014, 10:34 PM

Post #6 of 11 (4564 views)

Permalink

> On 14 Nov 2014, at 10:57 pm, Dmitry Matveichev <d.matveichev@mfisoft.ru> wrote:
>
> Hello,
>
> We have a cluster configured via pacemaker+corosync+crm. The configuration is:
>
> node master
> node slave
> primitive HA-VIP1 IPaddr2 \
> params ip=192.168.22.71 nic=bond0 \
> op monitor interval=1s
> primitive HA-variator lsb: variator \
> op monitor interval=1s \
> meta migration-threshold=1 failure-timeout=1s
> group HA-Group HA-VIP1 HA-variator
> property cib-bootstrap-options: \
> dc-version=1.1.10-14.el6-368c726 \
> cluster-infrastructure="classic openais (with plugin)" \

General advice, don't use the plugin. See:

http://blog.clusterlabs.org/blog/2013/pacemaker-and-rhel-6-dot-4/
http://blog.clusterlabs.org/blog/2013/pacemaker-on-rhel6-dot-4/

> expected-quorum-votes=2 \
> stonith-enabled=false \
> no-quorum-policy=ignore \
> last-lrm-refresh=1383871087
> rsc_defaults rsc-options: \
> resource-stickiness=100
>
> Firstly I make the variator service down on the master node (actually I delete the service binary and kill the variator process, so the variator fails to restart). Resources very quickly move on the slave node as expected. Then I return the binary on the master and restart the variator service. Now I make the same stuff with binary and service on slave node. The crm status command quickly shows me HA-variator (lsb: variator): Stopped. But it take to much time (for us) before recourses are switched on the master node (around 1 min).

I see what you mean:

2013-12-21T07:04:12.230827+04:00 master crmd[14267]: notice: te_rsc_command: Initiating action 2: monitor HA-variator_monitor_1000 on slave.mfisoft.ru
2013-12-21T05:45:09+04:00 slave crmd[7086]: notice: process_lrm_event: slave.mfisoft.ru-HA-variator_monitor_1000:106 [ variator.x is stopped\n ]

(1 minute goes by)

2013-12-21T07:05:14.232029+04:00 master crmd[14267]: error: print_synapse: [Action 2]: In-flight rsc op HA-variator_monitor_1000 on slave.mfisoft.ru (priority: 0, waiting: none)
2013-12-21T07:05:14.232102+04:00 master crmd[14267]: warning: cib_action_update: rsc_op 2: HA-variator_monitor_1000 on slave.mfisoft.ru timed out

Is there a corosync log file configured? That would have more detail on slave.

> Then line
> Failed actions:
> HA- variator _monitor_1000 on slave 'unknown error' (1): call=-1, status=Timed Out, last-rc-change='Sat Dec 21 03:59:45 2013', queued=0ms, exec=0ms
> appears in the crm status and recourses are switched.
>
> What is that timeout? Where I can change it?
>
> ------------------------
> Kind regards,
> Dmitriy Matveichev.
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: Long failover [ In reply to ]

arvidjaar at gmail

Nov 16, 2014, 11:17 PM

Post #7 of 11 (4550 views)

Permalink

On Mon, Nov 17, 2014 at 9:34 AM, Andrew Beekhof <andrew@beekhof.net> wrote:
>
>> On 14 Nov 2014, at 10:57 pm, Dmitry Matveichev <d.matveichev@mfisoft.ru> wrote:
>>
>> Hello,
>>
>> We have a cluster configured via pacemaker+corosync+crm. The configuration is:
>>
>> node master
>> node slave
>> primitive HA-VIP1 IPaddr2 \
>> params ip=192.168.22.71 nic=bond0 \
>> op monitor interval=1s
>> primitive HA-variator lsb: variator \
>> op monitor interval=1s \
>> meta migration-threshold=1 failure-timeout=1s
>> group HA-Group HA-VIP1 HA-variator
>> property cib-bootstrap-options: \
>> dc-version=1.1.10-14.el6-368c726 \
>> cluster-infrastructure="classic openais (with plugin)" \
>
> General advice, don't use the plugin. See:
>
> http://blog.clusterlabs.org/blog/2013/pacemaker-and-rhel-6-dot-4/
> http://blog.clusterlabs.org/blog/2013/pacemaker-on-rhel6-dot-4/
>
>> expected-quorum-votes=2 \
>> stonith-enabled=false \
>> no-quorum-policy=ignore \
>> last-lrm-refresh=1383871087
>> rsc_defaults rsc-options: \
>> resource-stickiness=100
>>
>> Firstly I make the variator service down on the master node (actually I delete the service binary and kill the variator process, so the variator fails to restart). Resources very quickly move on the slave node as expected. Then I return the binary on the master and restart the variator service. Now I make the same stuff with binary and service on slave node. The crm status command quickly shows me HA-variator (lsb: variator): Stopped. But it take to much time (for us) before recourses are switched on the master node (around 1 min).
>
> I see what you mean:
>
> 2013-12-21T07:04:12.230827+04:00 master crmd[14267]: notice: te_rsc_command: Initiating action 2: monitor HA-variator_monitor_1000 on slave.mfisoft.ru
> 2013-12-21T05:45:09+04:00 slave crmd[7086]: notice: process_lrm_event: slave.mfisoft.ru-HA-variator_monitor_1000:106 [ variator.x is stopped\n ]
>
> (1 minute goes by)
>
> 2013-12-21T07:05:14.232029+04:00 master crmd[14267]: error: print_synapse: [Action 2]: In-flight rsc op HA-variator_monitor_1000 on slave.mfisoft.ru (priority: 0, waiting: none)
> 2013-12-21T07:05:14.232102+04:00 master crmd[14267]: warning: cib_action_update: rsc_op 2: HA-variator_monitor_1000 on slave.mfisoft.ru timed out
>

Is it possible that pacemaker is confused by time difference on master
and slave?

> Is there a corosync log file configured? That would have more detail on slave.
>
>> Then line
>> Failed actions:
>> HA- variator _monitor_1000 on slave 'unknown error' (1): call=-1, status=Timed Out, last-rc-change='Sat Dec 21 03:59:45 2013', queued=0ms, exec=0ms
>> appears in the crm status and recourses are switched.
>>
>> What is that timeout? Where I can change it?
>>
>> ------------------------
>> Kind regards,
>> Dmitriy Matveichev.
>>
>> _______________________________________________
>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: Long failover [ In reply to ]

andrew at beekhof

Nov 16, 2014, 11:48 PM

Post #8 of 11 (4560 views)

Permalink

> On 17 Nov 2014, at 6:17 pm, Andrei Borzenkov <arvidjaar@gmail.com> wrote:
>
> On Mon, Nov 17, 2014 at 9:34 AM, Andrew Beekhof <andrew@beekhof.net> wrote:
>>
>>> On 14 Nov 2014, at 10:57 pm, Dmitry Matveichev <d.matveichev@mfisoft.ru> wrote:
>>>
>>> Hello,
>>>
>>> We have a cluster configured via pacemaker+corosync+crm. The configuration is:
>>>
>>> node master
>>> node slave
>>> primitive HA-VIP1 IPaddr2 \
>>> params ip=192.168.22.71 nic=bond0 \
>>> op monitor interval=1s
>>> primitive HA-variator lsb: variator \
>>> op monitor interval=1s \
>>> meta migration-threshold=1 failure-timeout=1s
>>> group HA-Group HA-VIP1 HA-variator
>>> property cib-bootstrap-options: \
>>> dc-version=1.1.10-14.el6-368c726 \
>>> cluster-infrastructure="classic openais (with plugin)" \
>>
>> General advice, don't use the plugin. See:
>>
>> http://blog.clusterlabs.org/blog/2013/pacemaker-and-rhel-6-dot-4/
>> http://blog.clusterlabs.org/blog/2013/pacemaker-on-rhel6-dot-4/
>>
>>> expected-quorum-votes=2 \
>>> stonith-enabled=false \
>>> no-quorum-policy=ignore \
>>> last-lrm-refresh=1383871087
>>> rsc_defaults rsc-options: \
>>> resource-stickiness=100
>>>
>>> Firstly I make the variator service down on the master node (actually I delete the service binary and kill the variator process, so the variator fails to restart). Resources very quickly move on the slave node as expected. Then I return the binary on the master and restart the variator service. Now I make the same stuff with binary and service on slave node. The crm status command quickly shows me HA-variator (lsb: variator): Stopped. But it take to much time (for us) before recourses are switched on the master node (around 1 min).
>>
>> I see what you mean:
>>
>> 2013-12-21T07:04:12.230827+04:00 master crmd[14267]: notice: te_rsc_command: Initiating action 2: monitor HA-variator_monitor_1000 on slave.mfisoft.ru
>> 2013-12-21T05:45:09+04:00 slave crmd[7086]: notice: process_lrm_event: slave.mfisoft.ru-HA-variator_monitor_1000:106 [ variator.x is stopped\n ]
>>
>> (1 minute goes by)
>>
>> 2013-12-21T07:05:14.232029+04:00 master crmd[14267]: error: print_synapse: [Action 2]: In-flight rsc op HA-variator_monitor_1000 on slave.mfisoft.ru (priority: 0, waiting: none)
>> 2013-12-21T07:05:14.232102+04:00 master crmd[14267]: warning: cib_action_update: rsc_op 2: HA-variator_monitor_1000 on slave.mfisoft.ru timed out
>>
>
> Is it possible that pacemaker is confused by time difference on master
> and slave?

Timeouts are all calculated locally. So it shouldn't be an issue (aside from trying to read the logs)

>
>> Is there a corosync log file configured? That would have more detail on slave.
>>
>>> Then line
>>> Failed actions:
>>> HA- variator _monitor_1000 on slave 'unknown error' (1): call=-1, status=Timed Out, last-rc-change='Sat Dec 21 03:59:45 2013', queued=0ms, exec=0ms
>>> appears in the crm status and recourses are switched.
>>>
>>> What is that timeout? Where I can change it?
>>>
>>> ------------------------
>>> Kind regards,
>>> Dmitriy Matveichev.
>>>
>>> _______________________________________________
>>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>
>>> Project Home: http://www.clusterlabs.org
>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>> Bugs: http://bugs.clusterlabs.org
>>
>>
>> _______________________________________________
>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: Long failover [ In reply to ]

d.matveichev at mfisoft

Nov 17, 2014, 1:31 AM

Post #9 of 11 (4554 views)

Permalink

Hello,

Debug logs from slave are attached. Hope it helps.

------------------------
Kind regards,
Dmitriy Matveichev.

-----Original Message-----
From: Andrew Beekhof [mailto:andrew@beekhof.net]
Sent: Monday, November 17, 2014 10:48 AM
To: The Pacemaker cluster resource manager
Subject: Re: [Pacemaker] Long failover

> On 17 Nov 2014, at 6:17 pm, Andrei Borzenkov <arvidjaar@gmail.com> wrote:
>
> On Mon, Nov 17, 2014 at 9:34 AM, Andrew Beekhof <andrew@beekhof.net> wrote:
>>
>>> On 14 Nov 2014, at 10:57 pm, Dmitry Matveichev <d.matveichev@mfisoft.ru> wrote:
>>>
>>> Hello,
>>>
>>> We have a cluster configured via pacemaker+corosync+crm. The configuration is:
>>>
>>> node master
>>> node slave
>>> primitive HA-VIP1 IPaddr2 \
>>> params ip=192.168.22.71 nic=bond0 \
>>> op monitor interval=1s
>>> primitive HA-variator lsb: variator \
>>> op monitor interval=1s \
>>> meta migration-threshold=1 failure-timeout=1s group HA-Group
>>> HA-VIP1 HA-variator property cib-bootstrap-options: \
>>> dc-version=1.1.10-14.el6-368c726 \
>>> cluster-infrastructure="classic openais (with plugin)" \
>>
>> General advice, don't use the plugin. See:
>>
>> http://blog.clusterlabs.org/blog/2013/pacemaker-and-rhel-6-dot-4/
>> http://blog.clusterlabs.org/blog/2013/pacemaker-on-rhel6-dot-4/
>>
>>> expected-quorum-votes=2 \
>>> stonith-enabled=false \
>>> no-quorum-policy=ignore \
>>> last-lrm-refresh=1383871087
>>> rsc_defaults rsc-options: \
>>> resource-stickiness=100
>>>
>>> Firstly I make the variator service down on the master node (actually I delete the service binary and kill the variator process, so the variator fails to restart). Resources very quickly move on the slave node as expected. Then I return the binary on the master and restart the variator service. Now I make the same stuff with binary and service on slave node. The crm status command quickly shows me HA-variator (lsb: variator): Stopped. But it take to much time (for us) before recourses are switched on the master node (around 1 min).
>>
>> I see what you mean:
>>
>> 2013-12-21T07:04:12.230827+04:00 master crmd[14267]: notice: te_rsc_command: Initiating action 2: monitor HA-variator_monitor_1000 on slave.mfisoft.ru
>> 2013-12-21T05:45:09+04:00 slave crmd[7086]: notice: process_lrm_event: slave.mfisoft.ru-HA-variator_monitor_1000:106 [ variator.x is stopped\n ]
>>
>> (1 minute goes by)
>>
>> 2013-12-21T07:05:14.232029+04:00 master crmd[14267]: error: print_synapse: [Action 2]: In-flight rsc op HA-variator_monitor_1000 on slave.mfisoft.ru (priority: 0, waiting: none)
>> 2013-12-21T07:05:14.232102+04:00 master crmd[14267]: warning:
>> cib_action_update: rsc_op 2: HA-variator_monitor_1000 on
>> slave.mfisoft.ru timed out
>>
>
> Is it possible that pacemaker is confused by time difference on master
> and slave?

Timeouts are all calculated locally. So it shouldn't be an issue (aside from trying to read the logs)

>
>> Is there a corosync log file configured? That would have more detail on slave.
>>
>>> Then line
>>> Failed actions:
>>> HA- variator _monitor_1000 on slave 'unknown error' (1): call=-1,
>>> status=Timed Out, last-rc-change='Sat Dec 21 03:59:45 2013', queued=0ms, exec=0ms appears in the crm status and recourses are switched.
>>>
>>> What is that timeout? Where I can change it?
>>>
>>> ------------------------
>>> Kind regards,
>>> Dmitriy Matveichev.
>>>
>>> _______________________________________________
>>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>
>>> Project Home: http://www.clusterlabs.org Getting started:
>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>> Bugs: http://bugs.clusterlabs.org
>>
>>
>> _______________________________________________
>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>> Project Home: http://www.clusterlabs.org Getting started:
>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org Getting started:
> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: Long failover [ In reply to ]

d.matveichev at mfisoft

Dec 2, 2014, 12:04 AM

Post #10 of 11 (4494 views)

Permalink

Hello,
Any thoughts about this issue? It still affects our cluster.

------------------------
Kind regards,
Dmitriy Matveichev.

-----Original Message-----
From: Dmitry Matveichev
Sent: Monday, November 17, 2014 12:32 PM
To: The Pacemaker cluster resource manager
Subject: RE: [Pacemaker] Long failover

Hello,

Debug logs from slave are attached. Hope it helps.

------------------------
Kind regards,
Dmitriy Matveichev.

-----Original Message-----
From: Andrew Beekhof [mailto:andrew@beekhof.net]
Sent: Monday, November 17, 2014 10:48 AM
To: The Pacemaker cluster resource manager
Subject: Re: [Pacemaker] Long failover

> On 17 Nov 2014, at 6:17 pm, Andrei Borzenkov <arvidjaar@gmail.com> wrote:
>
> On Mon, Nov 17, 2014 at 9:34 AM, Andrew Beekhof <andrew@beekhof.net> wrote:
>>
>>> On 14 Nov 2014, at 10:57 pm, Dmitry Matveichev <d.matveichev@mfisoft.ru> wrote:
>>>
>>> Hello,
>>>
>>> We have a cluster configured via pacemaker+corosync+crm. The configuration is:
>>>
>>> node master
>>> node slave
>>> primitive HA-VIP1 IPaddr2 \
>>> params ip=192.168.22.71 nic=bond0 \
>>> op monitor interval=1s
>>> primitive HA-variator lsb: variator \
>>> op monitor interval=1s \
>>> meta migration-threshold=1 failure-timeout=1s group HA-Group
>>> HA-VIP1 HA-variator property cib-bootstrap-options: \
>>> dc-version=1.1.10-14.el6-368c726 \
>>> cluster-infrastructure="classic openais (with plugin)" \
>>
>> General advice, don't use the plugin. See:
>>
>> http://blog.clusterlabs.org/blog/2013/pacemaker-and-rhel-6-dot-4/
>> http://blog.clusterlabs.org/blog/2013/pacemaker-on-rhel6-dot-4/
>>
>>> expected-quorum-votes=2 \
>>> stonith-enabled=false \
>>> no-quorum-policy=ignore \
>>> last-lrm-refresh=1383871087
>>> rsc_defaults rsc-options: \
>>> resource-stickiness=100
>>>
>>> Firstly I make the variator service down on the master node (actually I delete the service binary and kill the variator process, so the variator fails to restart). Resources very quickly move on the slave node as expected. Then I return the binary on the master and restart the variator service. Now I make the same stuff with binary and service on slave node. The crm status command quickly shows me HA-variator (lsb: variator): Stopped. But it take to much time (for us) before recourses are switched on the master node (around 1 min).
>>
>> I see what you mean:
>>
>> 2013-12-21T07:04:12.230827+04:00 master crmd[14267]: notice: te_rsc_command: Initiating action 2: monitor HA-variator_monitor_1000 on slave.mfisoft.ru
>> 2013-12-21T05:45:09+04:00 slave crmd[7086]: notice: process_lrm_event: slave.mfisoft.ru-HA-variator_monitor_1000:106 [ variator.x is stopped\n ]
>>
>> (1 minute goes by)
>>
>> 2013-12-21T07:05:14.232029+04:00 master crmd[14267]: error: print_synapse: [Action 2]: In-flight rsc op HA-variator_monitor_1000 on slave.mfisoft.ru (priority: 0, waiting: none)
>> 2013-12-21T07:05:14.232102+04:00 master crmd[14267]: warning:
>> cib_action_update: rsc_op 2: HA-variator_monitor_1000 on
>> slave.mfisoft.ru timed out
>>
>
> Is it possible that pacemaker is confused by time difference on master
> and slave?

Timeouts are all calculated locally. So it shouldn't be an issue (aside from trying to read the logs)

>
>> Is there a corosync log file configured? That would have more detail on slave.
>>
>>> Then line
>>> Failed actions:
>>> HA- variator _monitor_1000 on slave 'unknown error' (1): call=-1,
>>> status=Timed Out, last-rc-change='Sat Dec 21 03:59:45 2013', queued=0ms, exec=0ms appears in the crm status and recourses are switched.
>>>
>>> What is that timeout? Where I can change it?
>>>
>>> ------------------------
>>> Kind regards,
>>> Dmitriy Matveichev.
>>>
>>> _______________________________________________
>>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>
>>> Project Home: http://www.clusterlabs.org Getting started:
>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>> Bugs: http://bugs.clusterlabs.org
>>
>>
>> _______________________________________________
>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>> Project Home: http://www.clusterlabs.org Getting started:
>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org Getting started:
> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: Long failover [ In reply to ]

andrew at beekhof

Jan 7, 2015, 7:41 PM

Post #11 of 11 (4367 views)

Permalink

I need to see logs from both nodes that relate to the same instance of the issue.

Why are the dates so crazy?
One is from a year ago and the other is in the (at the time) future.

> On 2 Dec 2014, at 7:04 pm, Dmitry Matveichev <d.matveichev@mfisoft.ru> wrote:
>
> Hello,
> Any thoughts about this issue? It still affects our cluster.
>
> ------------------------
> Kind regards,
> Dmitriy Matveichev.
>
>
> -----Original Message-----
> From: Dmitry Matveichev
> Sent: Monday, November 17, 2014 12:32 PM
> To: The Pacemaker cluster resource manager
> Subject: RE: [Pacemaker] Long failover
>
> Hello,
>
> Debug logs from slave are attached. Hope it helps.
>
> ------------------------
> Kind regards,
> Dmitriy Matveichev.
>
> -----Original Message-----
> From: Andrew Beekhof [mailto:andrew@beekhof.net]
> Sent: Monday, November 17, 2014 10:48 AM
> To: The Pacemaker cluster resource manager
> Subject: Re: [Pacemaker] Long failover
>
>
>> On 17 Nov 2014, at 6:17 pm, Andrei Borzenkov <arvidjaar@gmail.com> wrote:
>>
>> On Mon, Nov 17, 2014 at 9:34 AM, Andrew Beekhof <andrew@beekhof.net> wrote:
>>>
>>>> On 14 Nov 2014, at 10:57 pm, Dmitry Matveichev <d.matveichev@mfisoft.ru> wrote:
>>>>
>>>> Hello,
>>>>
>>>> We have a cluster configured via pacemaker+corosync+crm. The configuration is:
>>>>
>>>> node master
>>>> node slave
>>>> primitive HA-VIP1 IPaddr2 \
>>>> params ip=192.168.22.71 nic=bond0 \
>>>> op monitor interval=1s
>>>> primitive HA-variator lsb: variator \
>>>> op monitor interval=1s \
>>>> meta migration-threshold=1 failure-timeout=1s group HA-Group
>>>> HA-VIP1 HA-variator property cib-bootstrap-options: \
>>>> dc-version=1.1.10-14.el6-368c726 \
>>>> cluster-infrastructure="classic openais (with plugin)" \
>>>
>>> General advice, don't use the plugin. See:
>>>
>>> http://blog.clusterlabs.org/blog/2013/pacemaker-and-rhel-6-dot-4/
>>> http://blog.clusterlabs.org/blog/2013/pacemaker-on-rhel6-dot-4/
>>>
>>>> expected-quorum-votes=2 \
>>>> stonith-enabled=false \
>>>> no-quorum-policy=ignore \
>>>> last-lrm-refresh=1383871087
>>>> rsc_defaults rsc-options: \
>>>> resource-stickiness=100
>>>>
>>>> Firstly I make the variator service down on the master node (actually I delete the service binary and kill the variator process, so the variator fails to restart). Resources very quickly move on the slave node as expected. Then I return the binary on the master and restart the variator service. Now I make the same stuff with binary and service on slave node. The crm status command quickly shows me HA-variator (lsb: variator): Stopped. But it take to much time (for us) before recourses are switched on the master node (around 1 min).
>>>
>>> I see what you mean:
>>>
>>> 2013-12-21T07:04:12.230827+04:00 master crmd[14267]: notice: te_rsc_command: Initiating action 2: monitor HA-variator_monitor_1000 on slave.mfisoft.ru
>>> 2013-12-21T05:45:09+04:00 slave crmd[7086]: notice: process_lrm_event: slave.mfisoft.ru-HA-variator_monitor_1000:106 [ variator.x is stopped\n ]
>>>
>>> (1 minute goes by)
>>>
>>> 2013-12-21T07:05:14.232029+04:00 master crmd[14267]: error: print_synapse: [Action 2]: In-flight rsc op HA-variator_monitor_1000 on slave.mfisoft.ru (priority: 0, waiting: none)
>>> 2013-12-21T07:05:14.232102+04:00 master crmd[14267]: warning:
>>> cib_action_update: rsc_op 2: HA-variator_monitor_1000 on
>>> slave.mfisoft.ru timed out
>>>
>>
>> Is it possible that pacemaker is confused by time difference on master
>> and slave?
>
> Timeouts are all calculated locally. So it shouldn't be an issue (aside from trying to read the logs)
>
>>
>>> Is there a corosync log file configured? That would have more detail on slave.
>>>
>>>> Then line
>>>> Failed actions:
>>>> HA- variator _monitor_1000 on slave 'unknown error' (1): call=-1,
>>>> status=Timed Out, last-rc-change='Sat Dec 21 03:59:45 2013', queued=0ms, exec=0ms appears in the crm status and recourses are switched.
>>>>
>>>> What is that timeout? Where I can change it?
>>>>
>>>> ------------------------
>>>> Kind regards,
>>>> Dmitriy Matveichev.
>>>>
>>>> _______________________________________________
>>>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>
>>>> Project Home: http://www.clusterlabs.org Getting started:
>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>> Bugs: http://bugs.clusterlabs.org
>>>
>>>
>>> _______________________________________________
>>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>
>>> Project Home: http://www.clusterlabs.org Getting started:
>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>> Bugs: http://bugs.clusterlabs.org
>>
>> _______________________________________________
>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>> Project Home: http://www.clusterlabs.org Getting started:
>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Mailing List Archive

Attached Files:

Attached Files: