Mailing List Archive: Reset failcount for resources

Reset failcount for resources

Nov 13, 2014, 3:08 AM

Post #1 of 4 (2056 views)

Hi

I am running a 2 node cluster with this config

Master/Slave Set: foo-master [foo]
Masters: [ bharat ]
Slaves: [ ram ]
AC_FLT (ocf::pw:IPaddr): Started bharat
CR_CP_FLT (ocf::pw:IPaddr): Started bharat
CR_UP_FLT (ocf::pw:IPaddr): Started bharat
Mgmt_FLT (ocf::pw:IPaddr): Started bharat

where IPaddr RA is just modified IPAddr2 RA. Additionally i have a
collocation constraint for the IP addr to be collocated with the master.
I have set the migration-threshold as 2 for the VIP. I also have set the
failure-timeout to 15s.

Initially i bring down the interface on bharat to force switch-over to ram.
After this i fail the interfaces on bharat again. Now i bring the interface
up again on ram. However the virtual IP's are now in stopped state.

I don't get out of this unless i use crm_resource -C to reset state of
resources.
However if i check failcount of resources after this it's still set as
INFINITY.
Based on the documentation the failcount on a node should have expired
after the failure-timeout.That doesn't happen. However why don't we reset
the count after the the crm_resource -C command too. Any other command to
actually reset the failcount.

Thanks in advance

Regards
Arjun

Re: Reset failcount for resources [ In reply to ]

andrew at beekhof

Nov 16, 2014, 10:40 PM

Post #2 of 4 (1970 views)

Permalink

> On 13 Nov 2014, at 10:08 pm, Arjun Pandey <apandepublic@gmail.com> wrote:
>
> Hi
>
> I am running a 2 node cluster with this config
>
> Master/Slave Set: foo-master [foo]
> Masters: [ bharat ]
> Slaves: [ ram ]
> AC_FLT (ocf::pw:IPaddr): Started bharat
> CR_CP_FLT (ocf::pw:IPaddr): Started bharat
> CR_UP_FLT (ocf::pw:IPaddr): Started bharat
> Mgmt_FLT (ocf::pw:IPaddr): Started bharat
>
> where IPaddr RA is just modified IPAddr2 RA. Additionally i have a
> collocation constraint for the IP addr to be collocated with the master.
> I have set the migration-threshold as 2 for the VIP. I also have set the failure-timeout to 15s.
>
>
> Initially i bring down the interface on bharat to force switch-over to ram. After this i fail the interfaces on bharat again. Now i bring the interface up again on ram. However the virtual IP's are now in stopped state.
>
> I don't get out of this unless i use crm_resource -C to reset state of resources.
> However if i check failcount of resources after this it's still set as INFINITY.

crm_resource didn't always reset the failcount. I'd encourage you to upgrade your pacemaker packages.

> Based on the documentation the failcount on a node should have expired after the failure-timeout.That doesn't happen. However why don't we reset the count after the the crm_resource -C command too. Any other command to actually reset the failcount.

There should be 'crm_failcount' that will do this

>
> Thanks in advance
>
> Regards
> Arjun
> _______________________________________________
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: Reset failcount for resources [ In reply to ]

alxgomz at gmail

Nov 16, 2014, 11:14 PM

Post #3 of 4 (1975 views)

Permalink

Le 13 nov. 2014 12:09, "Arjun Pandey" <apandepublic@gmail.com> a Ã©crit :
>
> Hi
>
> I am running a 2 node cluster with this config
>
> Master/Slave Set: foo-master [foo]
> Masters: [ bharat ]
> Slaves: [ ram ]
> AC_FLT (ocf::pw:IPaddr): Started bharat
> CR_CP_FLT (ocf::pw:IPaddr): Started bharat
> CR_UP_FLT (ocf::pw:IPaddr): Started bharat
> Mgmt_FLT (ocf::pw:IPaddr): Started bharat
>
> where IPaddr RA is just modified IPAddr2 RA. Additionally i have a
> collocation constraint for the IP addr to be collocated with the master.
> I have set the migration-threshold as 2 for the VIP. I also have set the
failure-timeout to 15s.
>
>
> Initially i bring down the interface on bharat to force switch-over to
ram. After this i fail the interfaces on bharat again. Now i bring the
interface up again on ram. However the virtual IP's are now in stopped
state.
>
> I don't get out of this unless i use crm_resource -C to reset state of
resources.
> However if i check failcount of resources after this it's still set as
INFINITY.
> Based on the documentation the failcount on a node should have expired
after the failure-timeout.That doesn't happen.

Expiration probably happens, meaning the failure is marked for expiration.
However, expired failures are only removed when the timer pops in, which is
defined by the cluster-recheck-interval (by default 15 mins).

> However why don't we reset the count after the the crm_resource -C
command too. Any other command to actually reset the failcount.
>
> Thanks in advance
>
> Regards
> Arjun
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>

Re: Reset failcount for resources [ In reply to ]

apandepublic at gmail

Nov 17, 2014, 7:09 AM

Post #4 of 4 (1972 views)

Permalink

Thanks Alexandre. Changing the cluster-recheck-interval worked for me :)

Regards
Arjun

On Mon, Nov 17, 2014 at 12:44 PM, Alexandre <alxgomz@gmail.com> wrote:

>
> Le 13 nov. 2014 12:09, "Arjun Pandey" <apandepublic@gmail.com> a Ã©crit :
> >
> > Hi
> >
> > I am running a 2 node cluster with this config
> >
> > Master/Slave Set: foo-master [foo]
> > Masters: [ bharat ]
> > Slaves: [ ram ]
> > AC_FLT (ocf::pw:IPaddr): Started bharat
> > CR_CP_FLT (ocf::pw:IPaddr): Started bharat
> > CR_UP_FLT (ocf::pw:IPaddr): Started bharat
> > Mgmt_FLT (ocf::pw:IPaddr): Started bharat
> >
> > where IPaddr RA is just modified IPAddr2 RA. Additionally i have a
> > collocation constraint for the IP addr to be collocated with the master.
> > I have set the migration-threshold as 2 for the VIP. I also have set the
> failure-timeout to 15s.
> >
> >
> > Initially i bring down the interface on bharat to force switch-over to
> ram. After this i fail the interfaces on bharat again. Now i bring the
> interface up again on ram. However the virtual IP's are now in stopped
> state.
> >
> > I don't get out of this unless i use crm_resource -C to reset state of
> resources.
> > However if i check failcount of resources after this it's still set as
> INFINITY.
> > Based on the documentation the failcount on a node should have expired
> after the failure-timeout.That doesn't happen.
>
> Expiration probably happens, meaning the failure is marked for expiration.
> However, expired failures are only removed when the timer pops in, which is
> defined by the cluster-recheck-interval (by default 15 mins).
>
> > However why don't we reset the count after the the crm_resource -C
> command too. Any other command to actually reset the failcount.
> >
> > Thanks in advance
> >
> > Regards
> > Arjun
> >
> > _______________________________________________
> > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> > http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> >
> > Project Home: http://www.clusterlabs.org
> > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > Bugs: http://bugs.clusterlabs.org
> >
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>
>