Mailing List Archive: Master Became Slave

Master Became Slave - Cluster unstable $$$

Apr 6, 2014, 5:02 PM

Post #1 of 5 (1675 views)

Hi all,

I am new to corosync/pacemaker and I have a 2 node "production" cluster
with (corosync+pacemaker+drbd)

Node1 = lws1h1.mydomain.com
Node2 = lws1h2.mydomain.com

they are in online/online failover setup ..... services are only running
where DRBD resides... the other node stays online to take over if Node1
fails

this is the SW versions:
corosync-2.3.0-1.el6.x86_64
drbd84-utils-8.4.2-1.el6.elrepo.x86_64
pacemaker-1.1.8-1.el6.x86_64
OS: CentOS 6.4 x64bit

the cluster is configured with Quorum (not sure what that is)

few days ago I placed one of the nodes in maintenance mode "after" services
where going bad due to a problem .... I dont remember the details of how I
moved/migrated the resources but I usually use LCMC GUI tool .... also I
did some restart for corosync / pacemaker in a random ways :$

after that.. Node1 became slave and Node2 became master!

services are now sticking on Node2 and I cant migrate them even by force to
Node1 (tried command line tools and LCMC tool)

more details/outputs:

*####################### Start ###############################*
[aalishe@lws1h1 ~]$ sudo crm_mon -Afro
Last updated: Sun Apr 6 15:25:52 2014
Last change: Sun Apr 6 14:16:15 2014 via crm_resource on
lws1h2.mydomain.com
Stack: corosync
Current DC: lws1h2.mydomain.com (2) - partition with quorum
Version: 1.1.8-1.el6-394e906
2 Nodes configured, unknown expected votes
10 Resources configured.

Online: [ lws1h1.mydomain.com lws1h2.mydomain.com ]

Full list of resources:

Resource Group: SuperMetaService
SuperFloatIP (ocf::heartbeat:IPaddr2): Started lws1h2.mydomain.com
SuperFs1 (ocf::heartbeat:Filesystem): Started lws1h2.mydomain.com
SuperFs2 (ocf::heartbeat:Filesystem): Started lws1h2.mydomain.com
SuperFs3 (ocf::heartbeat:Filesystem): Started lws1h2.mydomain.com
SuperFs4 (ocf::heartbeat:Filesystem): Started lws1h2.mydomain.com
Master/Slave Set: SuperDataClone [SuperData]
Masters: [ lws1h2.mydomain.com ]
Slaves: [ lws1h1.mydomain.com ]
SuperMetaSQL (ocf::mydomain:pgsql): Started lws1h2.mydomain.com
SuperGTS (ocf::mydomain:mmon): Started lws1h2.mydomain.com
SuperCQP (ocf::mydomain:mmon): Started lws1h2.mydomain.com

Node Attributes:
* Node lws1h1.mydomain.com:
* Node lws1h2.mydomain.com:
+ master-SuperData : 10000

Operations:
* Node lws1h2.mydomain.com:
SuperFs1: migration-threshold=1000000
+ (1241) start: rc=0 (ok)
SuperMetaSQL: migration-threshold=1000000
+ (1254) start: rc=0 (ok)
+ (1257) monitor: interval=30000ms rc=0 (ok)
SuperFloatIP: migration-threshold=1000000
+ (1236) start: rc=0 (ok) + (1239) monitor: interval=30000ms rc=0 (ok)
SuperData:0: migration-threshold=1000000 + (957) probe: rc=0 (ok) +
(1230) promot

*########################### End ###########################*

CRM Configuration
*####################### Start ###############################*
[aalishe@lws1h1 ~]$ sudo crm configure show
node $id="1" lws1h1.mydomain.com \
attributes standby="off"
node $id="2" lws1h2.mydomain.com \
attributes standby="off"
primitive SuperCQP ocf:mydomain:mmon \
params mmond="/opt/mydomain/platform/bin/"
cfgfile="/opt/mydomain/platform/etc/mmon_mydomain_cqp.xml"
pidfile="/opt/mydomain/platform/var/run/mmon_mydomain_cqp.pid"
user="mydomainsvc" db="bigdata" dbport="5434" \
operations $id="SuperCQP-operations" \
op start interval="0" timeout="120" \
op stop interval="0" timeout="120" \
op monitor interval="120" timeout="120" start-delay="0" \
meta target-role="started" is-managed="true"
primitive SuperData ocf:linbit:drbd \
params drbd_resource="r0" \
op monitor interval="60s" \
meta target-role="started"
primitive SuperFloatIP ocf:heartbeat:IPaddr2 \
params ip="10.100.0.225" cidr_netmask="24" \
op monitor interval="30s" \
meta target-role="started"
primitive SuperFs1 ocf:heartbeat:Filesystem \
params device="/dev/drbd1" directory="/mnt/drbd1" fstype="ext4" \
meta target-role="started"
primitive SuperFs2 ocf:heartbeat:Filesystem \
params device="/dev/drbd2" directory="/mnt/drbd2" fstype="ext4" \
meta target-role="started"
primitive SuperFs3 ocf:heartbeat:Filesystem \
params device="/dev/drbd3" directory="/mnt/drbd3" fstype="ext4"
primitive SuperFs4 ocf:heartbeat:Filesystem \
params device="/dev/drbd4" directory="/mnt/drbd4" fstype="ext4" \
meta target-role="started"
primitive SuperGTS ocf:mydomain:mmon \
params mmond="/opt/mydomain/platform/bin/"
cfgfile="/opt/mydomain/platform/etc/mmon_mydomain_gts.xml"
pidfile="/opt/mydomain/platform/var/run/mmon_mydomain_gts.pid" user=
"mydomainsvc" db="bigdata" \
operations $id="SuperGTS-operations" \
op start interval="0" timeout="120" \
op stop interval="0" timeout="120" \
op monitor interval="120" timeout="120" start-delay="0" \
meta target-role="started" is-managed="true"
primitive SuperMetaSQL ocf:mydomain:pgsql \
op monitor interval="30" timeout="30" depth="0" \
op start interval="0" timeout="120" \
op stop interval="0" timeout="120" \
params pgdata="/mnt/drbd1/pgsql/data" pgdb="bigdata" \
meta target-role="Started" is-managed="true"
group SuperMetaService SuperFloatIP SuperFs1 SuperFs2 SuperFs3 SuperFs4 \
meta target-role="Started"
ms SuperDataClone SuperData \
meta master-max="1" master-node-max="1" clone-max="2"
clone-node-max="1" notify="true" target-role="Started"
colocation CqpOnGts inf: SuperCQP SuperGTS
colocation GtsOnMeta inf: SuperGTS SuperMetaSQL
colocation MetaSQLonData inf: SuperMetaSQL SuperDataClone:Master
colocation ServiceOnDrbd inf: SuperMetaService SuperDataClone:Master
order CqpAfterGts inf: SuperGTS:start SuperCQP
order GtsAfterMeta inf: SuperMetaSQL:start SuperGTS
order MetaAfterService inf: SuperMetaService:start SuperMetaSQL
order ServiceAfterDrbd inf: SuperDataClone:promote SuperMetaService:start
property $id="cib-bootstrap-options" \
dc-version="1.1.8-1.el6-394e906" \
cluster-infrastructure="corosync" \
stonith-enabled="false" \
no-quorum-policy="ignore" \
last-lrm-refresh="1396817730" \
maintenance-mode="false"
*######################## End ###########################*

messages I saw the first time I migrated / or maybe unimgrated services

*############################Start ######################*
/usr/sbin/crm_resource -r SuperMetaService --migrate
WARNING: Creating rsc_location constraint 'cli-standby-SuperMetaService'
with a score of -INFINITY for resource SuperMetaService on
lws1h2.mydomain.com.
This will prevent SuperMetaService from running on lws1h2.mydomain.com
until the constraint is removed using the 'crm_resource -U' command or
manually with cibadmin
This will be the case even if lws1h2.mydomain.com is the last node in the
cluster
This message can be disabled with -Q

[aalishe@lws1h2.mydomain.com:~#] /usr/sbin/cibadmin --obj_type constraints
-C -X '<rsc_location id="cli-standby-SuperDataClone"
rsc="SuperDataClone"><rule id="cli-standby-SuperDataClone-rule"
score="-INFINITY" role="Master"><expression attribute="#uname"
id="cli-standby-SuperDataClone-expression" operation="eq"
value="lws1h2.mydomain.com"/></rule></rsc_location>'

[aalishe@lws1h2.mydomain.com:~#] /usr/sbin/crm_resource -r SuperMetaSQL
--migrate
Resource SuperMetaSQL not moved: not-active and no preferred location
specified.
Error performing operation: Invalid argument
*###############################End#############################*

Errors from System Logs

*############################# Start ###############################*
[aalishe@lws1h1 ~]$ sudo tail -n 300 /var/log/messages | grep -E
"error|warning"
Apr 6 14:16:14 lws1h1 cibmon[28024]: error: crm_element_value: Couldn't
find src in NULL
Apr 6 14:16:14 lws1h1 cibmon[28024]: error: crm_element_value: Couldn't
find src in NULL
Apr 6 14:16:15 lws1h1 cibmon[28024]: error: crm_element_value: Couldn't
find src in NULL
Apr 6 14:16:15 lws1h1 cibmon[28024]: error: crm_element_value: Couldn't
find src in NULL
Apr 6 14:16:15 lws1h1 kernel: crm_simulate[29878]: segfault at 1d4c0 ip
0000003a2284812c sp 00007fffd9cff2b0 error 4 in
libc-2.12.so[3a22800000+18b000]
Apr 6 14:16:15 lws1h1 kernel: crm_simulate[29886]: segfault at 1d4c0 ip
0000003a2284812c sp 00007fff7f89d140 error 4 in
libc-2.12.so[3a22800000+18b000]
Apr 6 14:16:15 lws1h1 kernel: crm_simulate[29893]: segfault at 1d4c0 ip
0000003a2284812c sp 00007fffef934290 error 4 in
libc-2.12.so[3a22800000+18b000]
Apr 6 14:16:15 lws1h1 kernel: crm_simulate[29900]: segfault at 1d4c0 ip
0000003a2284812c sp 00007fff7d4e0d90 error 4 in
libc-2.12.so[3a22800000+18b000]
Apr 6 14:16:51 lws1h1 cibmon[28024]: error: crm_element_value: Couldn't
find src in NULL
Apr 6 14:16:51 lws1h1 kernel: crm_simulate[31902]: segfault at 1d4c0 ip
0000003a2284812c sp 00007fff36787850 error 4 in
libc-2.12.so[3a22800000+18b000]
Apr 6 14:16:51 lws1h1 kernel: crm_simulate[31909]: segfault at 1d4c0 ip
0000003a2284812c sp 00007fff45ed4830 error 4 in
libc-2.12.so[3a22800000+18b000]
Apr 6 14:16:52 lws1h1 cibmon[28024]: error: crm_element_value: Couldn't
find src in NULL
Apr 6 14:16:52 lws1h1 kernel: crm_simulate[31917]: segfault at 1d4c0 ip
0000003a2284812c sp 00007fffe68127a0 error 4 in
libc-2.12.so[3a22800000+18b000]
Apr 6 14:16:52 lws1h1 kernel: crm_simulate[31924]: segfault at 1d4c0 ip
0000003a2284812c sp 00007fff80548b20 error 4 in
libc-2.12.so[3a22800000+18b000]
Apr 6 14:16:55 lws1h1 cibmon[28024]: error: crm_element_value: Couldn't
find src in NULL
Apr 6 14:16:56 lws1h1 kernel: crm_simulate[31958]: segfault at 1d4c0 ip
0000003a2284812c sp 00007ffff103d430 error 4 in
libc-2.12.so[3a22800000+18b000]
Apr 6 14:16:56 lws1h1 kernel: crm_simulate[31965]: segfault at 1d4c0 ip
0000003a2284812c sp 00007fffcb795e30 error 4 in
libc-2.12.so[3a22800000+18b000]
Apr 6 14:16:57 lws1h1 cibmon[28024]: error: crm_element_value: Couldn't
find src in NULL
Apr 6 16:12:37 lws1h1 crm_simulate[29796]: error: crm_xml_err: XML
Error: I/O warning : failed to load external entity
"/tmp/lcmc-test-387474fb-b2c3-4c7c-96ed-f207635579e2.xml"
Apr 6 16:12:37 lws1h1 crm_simulate[29796]: error: filename2xml: Parsing
failed (domain=8, level=1, code=1549): failed to load external entity
"/tmp/lcmc-test-387474fb-b2c3-4c7c-96ed-f207635579e2.xml"
Apr 6 16:12:37 lws1h1 crm_simulate[29796]: error: filename2xml: Couldn't
parse /tmp/lcmc-test-387474fb-b2c3-4c7c-96ed-f207635579e2.xml
Apr 6 16:12:37 lws1h1 crm_simulate[29796]: error: crm_abort:
xpath_search: Triggered assert at xml.c:2742 : xml_top != NULL
Apr 6 16:12:37 lws1h1 crm_simulate[29796]: error: crm_element_value:
Couldn't find validate-with in NULL
Apr 6 16:12:37 lws1h1 crm_simulate[29796]: error: crm_abort:
update_validation: Triggered assert at xml.c:2586 : *xml_blob != NULL
Apr 6 16:12:37 lws1h1 crm_simulate[29796]: error: crm_element_value:
Couldn't find validate-with in NULL
Apr 6 16:21:24 lws1h1 attrd[32458]: warning: attrd_cib_callback: Update
fail-count-SuperCQP=(null) failed: Transport endpoint is not connected
Apr 6 16:21:24 lws1h1 attrd[32458]: warning: attrd_cib_callback: Update
last-failure-SuperCQP=(null) failed: Transport endpoint is not connected
Apr 6 16:26:11 lws1h1 attrd[32458]: warning: attrd_cib_callback: Update
fail-count-SuperCQP=(null) failed: Transport endpoint is not connected
Apr 6 16:26:11 lws1h1 crmd[15595]: warning: decode_transition_key: Bad
UUID (crm-resource-2879) in sscanf result (3) for 0:0:crm-resource-2879
Apr 6 16:26:11 lws1h1 crmd[15595]: error: send_msg_via_ipc: Unknown
Sub-system (2879_crm_resource)... discarding message.
Apr 6 16:38:59 lws1h1 attrd[32458]: warning: attrd_cib_callback: Update
fail-count-SuperGTS=(null) failed: Transport endpoint is not connected
Apr 6 16:38:59 lws1h1 attrd[32458]: warning: attrd_cib_callback: Update
last-failure-SuperGTS=(null) failed: Transport endpoint is not connected
*#####################################End###########################*

Thanks inadvance for your help

** Please note that I will reward whomever "First" helps me "solve" this
issue **

--
View this message in context: http://linux-ha.996297.n3.nabble.com/Master-Became-Slave-Cluster-unstable-tp15583.html
Sent from the Linux-HA mailing list archive at Nabble.com.
_______________________________________________
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: Master Became Slave - Cluster unstable $$$ [ In reply to ]

lists at alteeve

Apr 6, 2014, 5:15 PM

Post #2 of 5 (1638 views)

Permalink

I can't speak to your specific problem, but I can say for certain that
you need to disable quorum[1] and enable stonith (also called
fencing[2]). Once stonith is configured (and tested) in pacemaker, be
sure to setup fencing in DRBD using the 'crm-fence-peer.sh' fence
handler[3].

digimer

1. https://alteeve.ca/w/Quorum
2. https://alteeve.ca/w/AN!Cluster_Tutorial_2#Concept.3B_Fencing
3.
https://alteeve.ca/w/AN!Cluster_Tutorial_2#Configuring_DRBD_Global_and_Common_Options
(replace 'rhcs_fence' for 'crm-fence-peer.sh')

On 06/04/14 08:02 PM, aalishe wrote:
> Hi all,
>
> I am new to corosync/pacemaker and I have a 2 node "production" cluster
> with (corosync+pacemaker+drbd)
>
> Node1 = lws1h1.mydomain.com
> Node2 = lws1h2.mydomain.com
>
> they are in online/online failover setup ..... services are only running
> where DRBD resides... the other node stays online to take over if Node1
> fails
>
> this is the SW versions:
> corosync-2.3.0-1.el6.x86_64
> drbd84-utils-8.4.2-1.el6.elrepo.x86_64
> pacemaker-1.1.8-1.el6.x86_64
> OS: CentOS 6.4 x64bit
>
> the cluster is configured with Quorum (not sure what that is)
>
> few days ago I placed one of the nodes in maintenance mode "after" services
> where going bad due to a problem .... I dont remember the details of how I
> moved/migrated the resources but I usually use LCMC GUI tool .... also I
> did some restart for corosync / pacemaker in a random ways :$
>
> after that.. Node1 became slave and Node2 became master!
>
> services are now sticking on Node2 and I cant migrate them even by force to
> Node1 (tried command line tools and LCMC tool)
>
>
> more details/outputs:
>
>
> *####################### Start ###############################*
> [aalishe@lws1h1 ~]$ sudo crm_mon -Afro
> Last updated: Sun Apr 6 15:25:52 2014
> Last change: Sun Apr 6 14:16:15 2014 via crm_resource on
> lws1h2.mydomain.com
> Stack: corosync
> Current DC: lws1h2.mydomain.com (2) - partition with quorum
> Version: 1.1.8-1.el6-394e906
> 2 Nodes configured, unknown expected votes
> 10 Resources configured.
>
>
> Online: [ lws1h1.mydomain.com lws1h2.mydomain.com ]
>
> Full list of resources:
>
> Resource Group: SuperMetaService
> SuperFloatIP (ocf::heartbeat:IPaddr2): Started lws1h2.mydomain.com
> SuperFs1 (ocf::heartbeat:Filesystem): Started lws1h2.mydomain.com
> SuperFs2 (ocf::heartbeat:Filesystem): Started lws1h2.mydomain.com
> SuperFs3 (ocf::heartbeat:Filesystem): Started lws1h2.mydomain.com
> SuperFs4 (ocf::heartbeat:Filesystem): Started lws1h2.mydomain.com
> Master/Slave Set: SuperDataClone [SuperData]
> Masters: [ lws1h2.mydomain.com ]
> Slaves: [ lws1h1.mydomain.com ]
> SuperMetaSQL (ocf::mydomain:pgsql): Started lws1h2.mydomain.com
> SuperGTS (ocf::mydomain:mmon): Started lws1h2.mydomain.com
> SuperCQP (ocf::mydomain:mmon): Started lws1h2.mydomain.com
>
> Node Attributes:
> * Node lws1h1.mydomain.com:
> * Node lws1h2.mydomain.com:
> + master-SuperData : 10000
>
> Operations:
> * Node lws1h2.mydomain.com:
> SuperFs1: migration-threshold=1000000
> + (1241) start: rc=0 (ok)
> SuperMetaSQL: migration-threshold=1000000
> + (1254) start: rc=0 (ok)
> + (1257) monitor: interval=30000ms rc=0 (ok)
> SuperFloatIP: migration-threshold=1000000
> + (1236) start: rc=0 (ok) + (1239) monitor: interval=30000ms rc=0 (ok)
> SuperData:0: migration-threshold=1000000 + (957) probe: rc=0 (ok) +
> (1230) promot
>
> *########################### End ###########################*
>
>
>
> CRM Configuration
> *####################### Start ###############################*
> [aalishe@lws1h1 ~]$ sudo crm configure show
> node $id="1" lws1h1.mydomain.com \
> attributes standby="off"
> node $id="2" lws1h2.mydomain.com \
> attributes standby="off"
> primitive SuperCQP ocf:mydomain:mmon \
> params mmond="/opt/mydomain/platform/bin/"
> cfgfile="/opt/mydomain/platform/etc/mmon_mydomain_cqp.xml"
> pidfile="/opt/mydomain/platform/var/run/mmon_mydomain_cqp.pid"
> user="mydomainsvc" db="bigdata" dbport="5434" \
> operations $id="SuperCQP-operations" \
> op start interval="0" timeout="120" \
> op stop interval="0" timeout="120" \
> op monitor interval="120" timeout="120" start-delay="0" \
> meta target-role="started" is-managed="true"
> primitive SuperData ocf:linbit:drbd \
> params drbd_resource="r0" \
> op monitor interval="60s" \
> meta target-role="started"
> primitive SuperFloatIP ocf:heartbeat:IPaddr2 \
> params ip="10.100.0.225" cidr_netmask="24" \
> op monitor interval="30s" \
> meta target-role="started"
> primitive SuperFs1 ocf:heartbeat:Filesystem \
> params device="/dev/drbd1" directory="/mnt/drbd1" fstype="ext4" \
> meta target-role="started"
> primitive SuperFs2 ocf:heartbeat:Filesystem \
> params device="/dev/drbd2" directory="/mnt/drbd2" fstype="ext4" \
> meta target-role="started"
> primitive SuperFs3 ocf:heartbeat:Filesystem \
> params device="/dev/drbd3" directory="/mnt/drbd3" fstype="ext4"
> primitive SuperFs4 ocf:heartbeat:Filesystem \
> params device="/dev/drbd4" directory="/mnt/drbd4" fstype="ext4" \
> meta target-role="started"
> primitive SuperGTS ocf:mydomain:mmon \
> params mmond="/opt/mydomain/platform/bin/"
> cfgfile="/opt/mydomain/platform/etc/mmon_mydomain_gts.xml"
> pidfile="/opt/mydomain/platform/var/run/mmon_mydomain_gts.pid" user=
> "mydomainsvc" db="bigdata" \
> operations $id="SuperGTS-operations" \
> op start interval="0" timeout="120" \
> op stop interval="0" timeout="120" \
> op monitor interval="120" timeout="120" start-delay="0" \
> meta target-role="started" is-managed="true"
> primitive SuperMetaSQL ocf:mydomain:pgsql \
> op monitor interval="30" timeout="30" depth="0" \
> op start interval="0" timeout="120" \
> op stop interval="0" timeout="120" \
> params pgdata="/mnt/drbd1/pgsql/data" pgdb="bigdata" \
> meta target-role="Started" is-managed="true"
> group SuperMetaService SuperFloatIP SuperFs1 SuperFs2 SuperFs3 SuperFs4 \
> meta target-role="Started"
> ms SuperDataClone SuperData \
> meta master-max="1" master-node-max="1" clone-max="2"
> clone-node-max="1" notify="true" target-role="Started"
> colocation CqpOnGts inf: SuperCQP SuperGTS
> colocation GtsOnMeta inf: SuperGTS SuperMetaSQL
> colocation MetaSQLonData inf: SuperMetaSQL SuperDataClone:Master
> colocation ServiceOnDrbd inf: SuperMetaService SuperDataClone:Master
> order CqpAfterGts inf: SuperGTS:start SuperCQP
> order GtsAfterMeta inf: SuperMetaSQL:start SuperGTS
> order MetaAfterService inf: SuperMetaService:start SuperMetaSQL
> order ServiceAfterDrbd inf: SuperDataClone:promote SuperMetaService:start
> property $id="cib-bootstrap-options" \
> dc-version="1.1.8-1.el6-394e906" \
> cluster-infrastructure="corosync" \
> stonith-enabled="false" \
> no-quorum-policy="ignore" \
> last-lrm-refresh="1396817730" \
> maintenance-mode="false"
> *######################## End ###########################*
>
>
> messages I saw the first time I migrated / or maybe unimgrated services
>
> *############################Start ######################*
> /usr/sbin/crm_resource -r SuperMetaService --migrate
> WARNING: Creating rsc_location constraint 'cli-standby-SuperMetaService'
> with a score of -INFINITY for resource SuperMetaService on
> lws1h2.mydomain.com.
> This will prevent SuperMetaService from running on lws1h2.mydomain.com
> until the constraint is removed using the 'crm_resource -U' command or
> manually with cibadmin
> This will be the case even if lws1h2.mydomain.com is the last node in the
> cluster
> This message can be disabled with -Q
>
> [aalishe@lws1h2.mydomain.com:~#] /usr/sbin/cibadmin --obj_type constraints
> -C -X '<rsc_location id="cli-standby-SuperDataClone"
> rsc="SuperDataClone"><rule id="cli-standby-SuperDataClone-rule"
> score="-INFINITY" role="Master"><expression attribute="#uname"
> id="cli-standby-SuperDataClone-expression" operation="eq"
> value="lws1h2.mydomain.com"/></rule></rsc_location>'
>
> [aalishe@lws1h2.mydomain.com:~#] /usr/sbin/crm_resource -r SuperMetaSQL
> --migrate
> Resource SuperMetaSQL not moved: not-active and no preferred location
> specified.
> Error performing operation: Invalid argument
> *###############################End#############################*
>
>
> Errors from System Logs
>
> *############################# Start ###############################*
> [aalishe@lws1h1 ~]$ sudo tail -n 300 /var/log/messages | grep -E
> "error|warning"
> Apr 6 14:16:14 lws1h1 cibmon[28024]: error: crm_element_value: Couldn't
> find src in NULL
> Apr 6 14:16:14 lws1h1 cibmon[28024]: error: crm_element_value: Couldn't
> find src in NULL
> Apr 6 14:16:15 lws1h1 cibmon[28024]: error: crm_element_value: Couldn't
> find src in NULL
> Apr 6 14:16:15 lws1h1 cibmon[28024]: error: crm_element_value: Couldn't
> find src in NULL
> Apr 6 14:16:15 lws1h1 kernel: crm_simulate[29878]: segfault at 1d4c0 ip
> 0000003a2284812c sp 00007fffd9cff2b0 error 4 in
> libc-2.12.so[3a22800000+18b000]
> Apr 6 14:16:15 lws1h1 kernel: crm_simulate[29886]: segfault at 1d4c0 ip
> 0000003a2284812c sp 00007fff7f89d140 error 4 in
> libc-2.12.so[3a22800000+18b000]
> Apr 6 14:16:15 lws1h1 kernel: crm_simulate[29893]: segfault at 1d4c0 ip
> 0000003a2284812c sp 00007fffef934290 error 4 in
> libc-2.12.so[3a22800000+18b000]
> Apr 6 14:16:15 lws1h1 kernel: crm_simulate[29900]: segfault at 1d4c0 ip
> 0000003a2284812c sp 00007fff7d4e0d90 error 4 in
> libc-2.12.so[3a22800000+18b000]
> Apr 6 14:16:51 lws1h1 cibmon[28024]: error: crm_element_value: Couldn't
> find src in NULL
> Apr 6 14:16:51 lws1h1 kernel: crm_simulate[31902]: segfault at 1d4c0 ip
> 0000003a2284812c sp 00007fff36787850 error 4 in
> libc-2.12.so[3a22800000+18b000]
> Apr 6 14:16:51 lws1h1 kernel: crm_simulate[31909]: segfault at 1d4c0 ip
> 0000003a2284812c sp 00007fff45ed4830 error 4 in
> libc-2.12.so[3a22800000+18b000]
> Apr 6 14:16:52 lws1h1 cibmon[28024]: error: crm_element_value: Couldn't
> find src in NULL
> Apr 6 14:16:52 lws1h1 kernel: crm_simulate[31917]: segfault at 1d4c0 ip
> 0000003a2284812c sp 00007fffe68127a0 error 4 in
> libc-2.12.so[3a22800000+18b000]
> Apr 6 14:16:52 lws1h1 kernel: crm_simulate[31924]: segfault at 1d4c0 ip
> 0000003a2284812c sp 00007fff80548b20 error 4 in
> libc-2.12.so[3a22800000+18b000]
> Apr 6 14:16:55 lws1h1 cibmon[28024]: error: crm_element_value: Couldn't
> find src in NULL
> Apr 6 14:16:56 lws1h1 kernel: crm_simulate[31958]: segfault at 1d4c0 ip
> 0000003a2284812c sp 00007ffff103d430 error 4 in
> libc-2.12.so[3a22800000+18b000]
> Apr 6 14:16:56 lws1h1 kernel: crm_simulate[31965]: segfault at 1d4c0 ip
> 0000003a2284812c sp 00007fffcb795e30 error 4 in
> libc-2.12.so[3a22800000+18b000]
> Apr 6 14:16:57 lws1h1 cibmon[28024]: error: crm_element_value: Couldn't
> find src in NULL
> Apr 6 16:12:37 lws1h1 crm_simulate[29796]: error: crm_xml_err: XML
> Error: I/O warning : failed to load external entity
> "/tmp/lcmc-test-387474fb-b2c3-4c7c-96ed-f207635579e2.xml"
> Apr 6 16:12:37 lws1h1 crm_simulate[29796]: error: filename2xml: Parsing
> failed (domain=8, level=1, code=1549): failed to load external entity
> "/tmp/lcmc-test-387474fb-b2c3-4c7c-96ed-f207635579e2.xml"
> Apr 6 16:12:37 lws1h1 crm_simulate[29796]: error: filename2xml: Couldn't
> parse /tmp/lcmc-test-387474fb-b2c3-4c7c-96ed-f207635579e2.xml
> Apr 6 16:12:37 lws1h1 crm_simulate[29796]: error: crm_abort:
> xpath_search: Triggered assert at xml.c:2742 : xml_top != NULL
> Apr 6 16:12:37 lws1h1 crm_simulate[29796]: error: crm_element_value:
> Couldn't find validate-with in NULL
> Apr 6 16:12:37 lws1h1 crm_simulate[29796]: error: crm_abort:
> update_validation: Triggered assert at xml.c:2586 : *xml_blob != NULL
> Apr 6 16:12:37 lws1h1 crm_simulate[29796]: error: crm_element_value:
> Couldn't find validate-with in NULL
> Apr 6 16:21:24 lws1h1 attrd[32458]: warning: attrd_cib_callback: Update
> fail-count-SuperCQP=(null) failed: Transport endpoint is not connected
> Apr 6 16:21:24 lws1h1 attrd[32458]: warning: attrd_cib_callback: Update
> last-failure-SuperCQP=(null) failed: Transport endpoint is not connected
> Apr 6 16:26:11 lws1h1 attrd[32458]: warning: attrd_cib_callback: Update
> fail-count-SuperCQP=(null) failed: Transport endpoint is not connected
> Apr 6 16:26:11 lws1h1 crmd[15595]: warning: decode_transition_key: Bad
> UUID (crm-resource-2879) in sscanf result (3) for 0:0:crm-resource-2879
> Apr 6 16:26:11 lws1h1 crmd[15595]: error: send_msg_via_ipc: Unknown
> Sub-system (2879_crm_resource)... discarding message.
> Apr 6 16:38:59 lws1h1 attrd[32458]: warning: attrd_cib_callback: Update
> fail-count-SuperGTS=(null) failed: Transport endpoint is not connected
> Apr 6 16:38:59 lws1h1 attrd[32458]: warning: attrd_cib_callback: Update
> last-failure-SuperGTS=(null) failed: Transport endpoint is not connected
> *#####################################End###########################*
>
>
>
> Thanks inadvance for your help
>
> ** Please note that I will reward whomever "First" helps me "solve" this
> issue **
>
>
>
> --
> View this message in context: http://linux-ha.996297.n3.nabble.com/Master-Became-Slave-Cluster-unstable-tp15583.html
> Sent from the Linux-HA mailing list archive at Nabble.com.
> _______________________________________________
> Linux-HA mailing list
> Linux-HA@lists.linux-ha.org
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
>

--
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without
access to education?
_______________________________________________
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: Master Became Slave - Cluster unstable $$$ [ In reply to ]

aalishe at gmail

Apr 6, 2014, 5:31 PM

Post #3 of 5 (1624 views)

Permalink

tell me what informations are still needed to be sure / certain solving the
problem ?

I can provide it all

thanks for your time

--
View this message in context: http://linux-ha.996297.n3.nabble.com/Master-Became-Slave-Cluster-unstable-tp15583p15585.html
Sent from the Linux-HA mailing list archive at Nabble.com.
_______________________________________________
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: Master Became Slave - Cluster unstable $$$ [ In reply to ]

maloja01 at arcor

Apr 6, 2014, 11:28 PM

Post #4 of 5 (1617 views)

Permalink

FIRST you need to setup fencing (STONITH) - I do not see any stonith
resource in your cluster - that WILL be a problem in your cluster.

You could not "migrate" a Master/Slave. You Should use "crm_master" to
Score the Master-Placement. And you should remove all
client-Prefer-location-rules which you added by your "experiments" using
the GUI, they
might hurt the cluster in the future...

AND as already written in this thread you must tell the cluster to
ignore quorum, if you really only have a two-node-cluster.

FMaloja

On 04/07/2014 02:31 AM, aalishe wrote:
> tell me what informations are still needed to be sure / certain solving the
> problem ?
>
> I can provide it all
>
> thanks for your time
>
>
>
> --
> View this message in context: http://linux-ha.996297.n3.nabble.com/Master-Became-Slave-Cluster-unstable-tp15583p15585.html
> Sent from the Linux-HA mailing list archive at Nabble.com.
> _______________________________________________
> Linux-HA mailing list
> Linux-HA@lists.linux-ha.org
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
>

_______________________________________________
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: Master Became Slave - Cluster unstable $$$ [ In reply to ]

maloja01 at arcor

Apr 8, 2014, 2:33 AM

Post #5 of 5 (1618 views)

Permalink

On 04/08/2014 12:18 AM, Ammar Sheikh Saleh wrote:
> yes i have the command ... its CentOS

Then please review the man page of crm_master and try to adjust the
scores where you want to start the master and where you want to start
the slave. Before you follow my general steps you could also ask again
on the list about using crm_master fom command line on centos - I am not
really sure if it is really the same.

1. Check the current promotion scores using the pengine:
ptest -Ls | grep promo
-> You should get a list of scores per master/slave resources and node

2. Check the set crm_master score using crm_master:
crm_master -q -G -N <node> -l reboot -r <resource-INSIDE-masterslave>

3. Adjust the master/promotion scores (this is the most tricky part)
crm_master -v <NEW_MASTER_VALUE> -l reboot -r <resource-INSIDE-masterslave>

If you do not have constraints added by bad operations before that
might help the cluster to promote the preferred site.

But my procedure is without any warranty and further support, sorry.

Maloja01

>
>
> On Mon, Apr 7, 2014 at 4:16 PM, Maloja01 <maloja01@arcor.de> wrote:
>
>> On 04/07/2014 03:00 PM, Ammar Sheikh Saleh wrote:
>>
>>> thanks for your help ... can you guide me to the correct commands :
>>>
>>> I dont understand with is <rsc> in this command
>>>
>>> crm(live)node# attribute
>>> usage:
>>> attribute <node> set <rsc> <value>
>>> attribute <node> delete <rsc>
>>> attribute <node> show <rsc>
>>>
>>>
>>> how can I give a node a master attribute with high score in the above ?
>>>
>>
>> At SLES (SUSE) there is a command crm_master - do you have such a command?
>>
>>
>>
>>> cheers!
>>> Ammar
>>>
>>>
>>> On Mon, Apr 7, 2014 at 3:49 PM, Maloja01 <maloja01@arcor.de> wrote:
>>>
>>> On 04/07/2014 01:23 PM, Ammar Sheikh Saleh wrote:
>>>>
>>>> thanks a million time for answering ...
>>>>>
>>>>> I included all my software versions /OS details in the thread (
>>>>> http://linux-ha.996297.n3.nabble.com/Master-Became-
>>>>> Slave-Cluster-unstable-td15583.html)
>>>>>
>>>>>
>>>>> but here they are :
>>>>> this is the SW versions:
>>>>> corosync-2.3.0-1.el6.x86_64
>>>>> drbd84-utils-8.4.2-1.el6.elrepo.x86_64
>>>>> pacemaker-1.1.8-1.el6.x86_64
>>>>> OS: CentOS 6.4 x64bit
>>>>>
>>>>>
>>>> Ah sorry than I couldn't tell how stonith is working, as RH has a
>>>> complete
>>>> different setup beyond pacemaker. I do not know how they implement
>>>> fencing.
>>>>
>>>> Sorry - but however best regards
>>>> F.Maloja
>>>>
>>>>
>>>>
>>>>
>>>>> I need to correct something .. the setup is 2 nodes for HA and third one
>>>>> for Quorm only ... also config changed a little bit (attached)
>>>>>
>>>>> looking at the config right now ... I see a suspicious line
>>>>>
>>>>> ( <rule role="Master" score="-INFINITY"
>>>>> id="drbd-fence-by-handler-r0-rule-SuperDataClone">
>>>>> <expression attribute="#uname" operation="ne" value="
>>>>> lws1h1.npario.com" id="drbd-fence-by-handler-r0-expr-SuperDataClone"/>
>>>>> )
>>>>>
>>>>> it might be the reason why services are not starting on the first node
>>>>> (not
>>>>> sure%100) ... I have a feeling your answer is the right one ... but I
>>>>> dont
>>>>> know the correct commands to do them:
>>>>>
>>>>> 1- I need to put back the Node1 back to Master (currently it is slaved)
>>>>> 2- remove any constraints Or preferred locations .... also remove
>>>>> any
>>>>> special attributes on Node1 that is making it slave
>>>>>
>>>>> what do you think ? how can I do these ? what are commands
>>>>>
>>>>>
>>>>> cheers!
>>>>> Ammar
>>>>>
>>>>>
>>>>> On Mon, Apr 7, 2014 at 2:09 PM, Maloja01 <maloja01@arcor.de> wrote:
>>>>>
>>>>> hi ammar,
>>>>>
>>>>>>
>>>>>> first we need to check:
>>>>>> a) which OS (which Linux dist are you using)?
>>>>>> b) which cluster version/packages do you have installed?
>>>>>> c) how is your cluster config look like?
>>>>>>
>>>>>> As my tipp with crm_master/removing client-prefer-rules might only help
>>>>>> in
>>>>>> some combinations of a-b-c I need that info.
>>>>>>
>>>>>> Second I need to tell that free help means that I will not give any
>>>>>> warrinty or whatever that your cluster gets better afterward. This
>>>>>> could
>>>>>> only be done by a cost intensive consulting.
>>>>>>
>>>>>> regards
>>>>>> f.maloja
>>>>>>
>>>>>>
>>>>>> On 04/07/2014 12:10 PM, aalishe@gmail.com wrote:
>>>>>>
>>>>>> hi Maloja01,
>>>>>>
>>>>>>>
>>>>>>> could you please tell me help do these steps/ what commands look like:
>>>>>>>
>>>>>>> - You Should use "crm_master" to Score the Master-Placement
>>>>>>> - you should remove all client-Prefer-location-rules which you added
>>>>>>> by
>>>>>>> your "experiments" using the GUI
>>>>>>>
>>>>>>> I dont want to make this worst ... production is down here.. and I am
>>>>>>> desperately in need of any help
>>>>>>>
>>>>>>> Thank you,
>>>>>>> Ammar
>>>>>>>
>>>>>>> _____________________________________
>>>>>>> Sent from http://linux-ha.996297.n3.nabble.com
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>>
>>
>
>

_______________________________________________
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems