Mailing List Archive: large cluster

large cluster - failure recovery

Nov 4, 2015, 10:41 AM

Post #1 of 4 (1669 views)

Hi,

I have a cluster of 32 nodes, and after some tuning was able to have it
started and running,
but it does not recover from a node disconnect-connect failure.
It regains quorum, but CIB does not recover to a synchronized state and
"cibadmin -Q" times out.

Is there anything with corosync or pacemaker parameters I can do to make it
recover from such a situation
(everything works for smaller clusters).

In my case it is OK for a node to disconnect (all the major resources are
shutdown)
and later reconnect the cluster (the running monitoring agent will cleanup
and restart major resources if needed),
so I do not have STONITH configured.

Details:
OS: CentOS 6
Pacemaker: Pacemaker 1.1.9-1512.el6
Corosync: Corosync Cluster Engine, version '2.3.2'

Corosync configuration:
token: 10000
#token_retransmits_before_loss_const: 10
consensus: 15000
join: 1000
send_join: 80
merge: 1000
downcheck: 2000
#rrp_problem_count_timeout: 5000
max_network_delay: 150 # for azure

Some logs:

[...]
Nov 04 17:50:18 [7985] ip-10-142-181-98 stonith-ng: notice:
cib_process_diff: Diff 1.9254.1 -> 1.9255.1 from local not applied
to 1.9275.1: current "epoch" is greater than required
Nov 04 17:50:18 [7985] ip-10-142-181-98 stonith-ng: notice:
update_cib_cache_cb: [cib_diff_notify] Patch aborted: Application of
an update diff failed (-1006)
Nov 04 17:50:18 [7985] ip-10-142-181-98 stonith-ng: notice:
cib_process_diff: Diff 1.9255.1 -> 1.9256.1 from local not applied
to 1.9275.1: current "epoch" is greater than required
Nov 04 17:50:18 [7985] ip-10-142-181-98 stonith-ng: notice:
update_cib_cache_cb: [cib_diff_notify] Patch aborted: Application of
an update diff failed (-1006)
Nov 04 17:50:18 [7985] ip-10-142-181-98 stonith-ng: notice:
cib_process_diff: Diff 1.9256.1 -> 1.9257.1 from local not applied
to 1.9275.1: current "epoch" is greater than required
Nov 04 17:50:18 [7985] ip-10-142-181-98 stonith-ng: notice:
update_cib_cache_cb: [cib_diff_notify] Patch aborted: Application of
an update diff failed (-1006)
Nov 04 17:50:18 [7985] ip-10-142-181-98 stonith-ng: notice:
cib_process_diff: Diff 1.9257.1 -> 1.9258.1 from local not applied
to 1.9275.1: current "epoch" is greater than required
Nov 04 17:50:18 [7985] ip-10-142-181-98 stonith-ng: notice:
update_cib_cache_cb: [cib_diff_notify] Patch aborted: Application of
an update diff failed (-1006)
[...]

[...]
Nov 04 17:43:24 [12176] ip-10-109-145-175 crm_mon: error:
cib_native_perform_op_delegate: Couldn't perform cib_query
operation (timeout=120s): Operation already in progress (-114)
Nov 04 17:43:24 [12176] ip-10-109-145-175 crm_mon: error:
get_cib_copy: Couldnt retrieve the CIB
Nov 04 17:43:24 [12176] ip-10-109-145-175 crm_mon: error:
cib_native_perform_op_delegate: Couldn't perform cib_query
operation (timeout=120s): Operation already in progress (-114)
Nov 04 17:43:24 [12176] ip-10-109-145-175 crm_mon: error:
get_cib_copy: Couldnt retrieve the CIB
Nov 04 17:47:40 [10599] ip-10-109-145-175 corosync notice [QUORUM]
Members[32]: 3 27 11 29 23 21 24 9 17 12 32 13 2 10 16 15 6 28 19 1 22 26 5\
Nov 04 17:47:40 [10599] ip-10-109-145-175 corosync notice [QUORUM]
Members[32]: 14 20 31 30 8 25 18 7 4
Nov 04 17:47:40 [10599] ip-10-109-145-175 corosync notice [MAIN ]
Completed service synchronization, ready to provide service.
Nov 04 18:06:55 [10599] ip-10-109-145-175 corosync notice [QUORUM]
Members[32]: 3 27 11 29 23 21 24 9 17 12 32 13 2 10 16 15 6 28 19 1 22 26 5\
Nov 04 18:06:55 [10599] ip-10-109-145-175 corosync notice [QUORUM]
Members[32]: 14 20 31 30 8 25 18 7 4
[...]

[...]
Nov 04 18:21:15 [17749] ip-10-178-149-131 stonith-ng: notice:
update_cib_cache_cb: [cib_diff_notify] Patch aborted: Application of an
update diff failed (-1006)
Nov 04 18:21:15 [17749] ip-10-178-149-131 stonith-ng: info:
apply_xml_diff: Digest mis-match: expected
01192e5118739b7c33c23f7645da3f45, calculated
f8028c0c98526179ea5df0a2ba0d09de
Nov 04 18:21:15 [17749] ip-10-178-149-131 stonith-ng: warning:
cib_process_diff: Diff 1.15046.2 -> 1.15046.3 from local not applied
to 1.15046.2: Failed application of an update diff
Nov 04 18:21:15 [17749] ip-10-178-149-131 stonith-ng: notice:
update_cib_cache_cb: [cib_diff_notify] Patch aborted: Application of an
update diff failed (-1006)
Nov 04 18:21:15 [17749] ip-10-178-149-131 stonith-ng: notice:
cib_process_diff: Diff 1.15046.2 -> 1.15046.3 from local not applied
to 1.15046.3: current "num_updates" is greater than required
[...]

ps. Sorry if should posted on corosync newsgroup, just the CIB
synchronization fails, so this group seemed to me the right place.

--
Best Regards,

Radoslaw Garbacz

Re: large cluster - failure recovery [ In reply to ]

themsley at voiceflex

Nov 4, 2015, 10:50 AM

Post #2 of 4 (1643 views)

Permalink

On 04/11/15 18:41, Radoslaw Garbacz wrote:
> Details:
> OS: CentOS 6
> Pacemaker: Pacemaker 1.1.9-1512.el6
> Corosync: Corosync Cluster Engine, version '2.3.2'

yum update

Pacemaker is currently 1.1.12 and corosync 1.4.7 on CentOS 6. There were
major improvements in speed with later versions of pacemaker.

Trevor

_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: large cluster - failure recovery [ In reply to ]

radoslaw.garbacz at xtremedatainc

Nov 4, 2015, 2:26 PM

Post #3 of 4 (1638 views)

Permalink

Thank you, will give it a try.

On Wed, Nov 4, 2015 at 12:50 PM, Trevor Hemsley <themsley@voiceflex.com>
wrote:

> On 04/11/15 18:41, Radoslaw Garbacz wrote:
> > Details:
> > OS: CentOS 6
> > Pacemaker: Pacemaker 1.1.9-1512.el6
> > Corosync: Corosync Cluster Engine, version '2.3.2'
>
> yum update
>
> Pacemaker is currently 1.1.12 and corosync 1.4.7 on CentOS 6. There were
> major improvements in speed with later versions of pacemaker.
>
> Trevor
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>

--
Best Regards,

Radoslaw Garbacz
XtremeData Incorporation

Re: large cluster - failure recovery [ In reply to ]

cedric.dufour at idiap

Nov 19, 2015, 12:28 AM

Post #4 of 4 (1594 views)

Permalink

Hello,

We've also setup a fairly large cluster - 24 nodes / 348 resources (pacemaker 1.1.12, corosync 1.4.7) - and pacemaker 1.1.12 is definitely the minimum version you'll want, thanks to changes on how the CIB is handled.

If you're going to handle a large number (~several hundreds) of resources as well, you may need to concern yourself with the CIB size as well.

You may want to have a look at pp.17-18 of the document I wrote to describre our setup: http://cedric.dufour.name/cv/download/idiap_havc2.pdf

Currently, I would consider that with 24 nodes / 348 resources, we are close to the limit of what our cluster can handle, the bottleneck being CPU(core) power for CIB/CRM handling. Our "worst performing nodes" (out of the 24 in the cluster) are Xeon E7-2830 @ 2.13GHz.
The main issue we currently face in when a DC is taken out and a new one must be elected: CPU goes 100% for several tens of seconds (even minutes), during which the cluster is totally unresponsive. Fortunately, resources themselves just seat tight and remain available (I can't say about those who would need to be migrated because being colocated with the DC; we manually avoid that situation when performing maintenance that may affect the DC)

I'm looking forwards to migrate to corosync 2+ (there are some backports available for debian/Jessie) and see it this would allow to push the limit further. Unfortunately, I can't say for sure as I have only a limited understanding of how Pacemaker/Corosync work and where CPU is bond to become a bottleneck.

'Hope it can help,

Cédric

On 04/11/15 23:26, Radoslaw Garbacz wrote:
> Thank you, will give it a try.
>
> On Wed, Nov 4, 2015 at 12:50 PM, Trevor Hemsley <themsley@voiceflex.com <mailto:themsley@voiceflex.com>> wrote:
>
> On 04/11/15 18:41, Radoslaw Garbacz wrote:
> > Details:
> > OS: CentOS 6
> > Pacemaker: Pacemaker 1.1.9-1512.el6
> > Corosync: Corosync Cluster Engine, version '2.3.2'
>
> yum update
>
> Pacemaker is currently 1.1.12 and corosync 1.4.7 on CentOS 6. There were
> major improvements in speed with later versions of pacemaker.
>
> Trevor
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org <mailto:Pacemaker@oss.clusterlabs.org>
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>
>
>
>
> --
> Best Regards,
>
> Radoslaw Garbacz
> XtremeData Incorporation
>
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org