Mailing List Archive

pacemaker error after a couple week or month
Hello,

I have 2 active-passive fail over system with corosync and drbd.
One system using 2 debian server and the other using 2 ubuntu server.
The debian servers are for web server fail over and the ubuntu servers are
for database server fail over.

I applied the same configuration in the pacemaker. Everything works fine,
fail over can be done nicely and also the file system synchronization, but
in the ubuntu server, it was always has error after a couple week or month.
The pacemaker in ubuntu1 had different status with ubuntu2, ubuntu1 assumed
that ubuntu2 was down and ubuntu2 assumed that something happened with
ubuntu1 but still alive and took over the resources. It made the drbd
resource cannot be taken over, thus no fail over happened and we must
manually restart the server because restarting pacemaker and corosync
didn't help. I have changed the configuration of pacemaker a couple time,
but the problem still exist.

has anyone experienced it? I use Ubuntu 14.04.1 LTS.

I got this error in apport.log

ERROR: apport (pid 20361) Fri Dec 19 02:43:52 2014: executable:
/usr/lib/pacemaker/lrmd (command line "/usr/lib/pacemaker/lrmd")
ERROR: apport (pid 20361) Fri Dec 19 02:43:52 2014: is_closing_session():
no DBUS_SESSION_BUS_ADDRESS in environment
ERROR: apport (pid 20361) Fri Dec 19 02:43:52 2014: wrote report
/var/crash/_usr_lib_pacemaker_lrmd.0.crash

my pacemaker configuration:

node $id="1" db \
attributes standby="off"
node $id="2" db2 \
attributes standby="off"
primitive ClusterIP ocf:heartbeat:IPaddr2 \
params ip="192.168.0.100" cidr_netmask="24" \
op monitor interval="30s"
primitive DBase ocf:heartbeat:mysql \
meta target-role="Started" \
op start timeout="120s" interval="0" \
op stop timeout="120s" interval="0" \
op monitor interval="20s" timeout="30s"
primitive DbFS ocf:heartbeat:Filesystem \
params device="/dev/drbd0" directory="/sync" fstype="ext4" \
op start timeout="60s" interval="0" \
op stop timeout="180s" interval="0" \
op monitor interval="60s" timeout="60s"
primitive Links lsb:drbdlinks
primitive r0 ocf:linbit:drbd \
params drbd_resource="r0" \
op monitor interval="29s" role="Master" \
op start timeout="240s" interval="0" \
op stop timeout="180s" interval="0" \
op promote timeout="180s" interval="0" \
op demote timeout="180s" interval="0" \
op monitor interval="30s" role="Slave"
group DbServer ClusterIP DbFS Links DBase
ms ms_r0 r0 \
meta master-max="1" master-node-max="1" clone-max="2"
clone-node-max="1" notify="true" target-role="Master"
location prefer-db DbServer 50: db
colocation DbServer-with-ms_ro inf: DbServer ms_r0:Master
order DbServer-after-ms_ro inf: ms_r0:promote DbServer:start
property $id="cib-bootstrap-options" \
dc-version="1.1.10-42f2063" \
cluster-infrastructure="corosync" \
expected-quorum-votes="2" \
stonith-enabled="false" \
no-quorum-policy="ignore" \
last-lrm-refresh="1363370585"

my corosync config:

totem {
version: 2
token: 3000
token_retransmits_before_loss_const: 10
join: 60
consensus: 3600
vsftype: none
max_messages: 20
clear_node_high_bit: yes
secauth: off
threads: 0
rrp_mode: none
transport: udpu
cluster_name: Dbcluster
}

nodelist {
node {
ring0_addr: db
nodeid: 1
}
node {
ring0_addr: db2
nodeid: 2
}
}

quorum {
provider: corosync_votequorum
}

amf {
mode: disabled
}

service {
ver: 0
name: pacemaker
}

aisexec {
user: root
group: root
}

logging {
fileline: off
to_stderr: yes
to_logfile: yes
logfile: /var/log/corosync/corosync.log
to_syslog: no
syslog_facility: daemon
debug: off
timestamp: on
logger_subsys {
subsys: AMF
debug: off
tags: enter|leave|trace1|trace2|trace3|trace4|trace6
}
}

my drbd.conf:

global {
usage-count no;
}

common {
protocol C;

handlers {
pri-on-incon-degr
"/usr/lib/drbd/notify-pri-on-incon-degr.sh;
/usr/lib/drbd/notify-emergency-reboot.sh; echo b > /proc/sysrq-trigger ;
reboot -f";
pri-lost-after-sb
"/usr/lib/drbd/notify-pri-lost-after-sb.sh;
/usr/lib/drbd/notify-emergency-reboot.sh; echo b > /proc/sysrq-trigger ;
reboot -f";
local-io-error "/usr/lib/drbd/notify-io-error.sh;
/usr/lib/drbd/notify-emergency-shutdown.sh; echo o > /proc/sysrq-trigger ;
halt -f";
}

startup {
degr-wfc-timeout 120;
}

disk {
on-io-error detach;
}

syncer {
rate 100M;
al-extents 257;
}
}

resource r0 {
protocol C;
flexible-meta-disk internal;

on db2 {
address 192.168.0.10:7801;
device /dev/drbd0 minor 0;
disk /dev/sdb1;
}
on db {
device /dev/drbd0 minor 0;
disk /dev/db/sync;
address 192.168.0.20:7801;
}
handlers {
split-brain "/usr/lib/drbd/notify-split-brain.sh root";
}
net {
after-sb-0pri discard-younger-primary;
#discard-zero-changes;
after-sb-1pri discard-secondary;
after-sb-2pri call-pri-lost-after-sb;
}
}

I have no idea, how to solve this problem. Maybe someone can help me.

best regards,

ariee
Re: pacemaker error after a couple week or month [ In reply to ]
----- Original Message -----
> Hello,
>
> I have 2 active-passive fail over system with corosync and drbd.
> One system using 2 debian server and the other using 2 ubuntu server.
> The debian servers are for web server fail over and the ubuntu servers are
> for database server fail over.
>
> I applied the same configuration in the pacemaker. Everything works fine,
> fail over can be done nicely and also the file system synchronization, but
> in the ubuntu server, it was always has error after a couple week or month.
> The pacemaker in ubuntu1 had different status with ubuntu2, ubuntu1 assumed
> that ubuntu2 was down and ubuntu2 assumed that something happened with
> ubuntu1 but still alive and took over the resources. It made the drbd
> resource cannot be taken over, thus no fail over happened and we must
> manually restart the server because restarting pacemaker and corosync didn't
> help. I have changed the configuration of pacemaker a couple time, but the
> problem still exist.
>
> has anyone experienced it? I use Ubuntu 14.04.1 LTS.
>
> I got this error in apport.log
>
> ERROR: apport (pid 20361) Fri Dec 19 02:43:52 2014: executable:
> /usr/lib/pacemaker/lrmd (command line "/usr/lib/pacemaker/lrmd")

wow, it looks like the lrmd is crashing on you. I haven't seen this occur
in the wild before. Without a backtrace it will be nearly impossible to determine
what is happening.

Do you have the ability to upgrade pacemaker to a newer version?

-- Vossel

> ERROR: apport (pid 20361) Fri Dec 19 02:43:52 2014: is_closing_session(): no
> DBUS_SESSION_BUS_ADDRESS in environment
> ERROR: apport (pid 20361) Fri Dec 19 02:43:52 2014: wrote report
> /var/crash/_usr_lib_pacemaker_lrmd.0.crash
>
> my pacemaker configuration:
>
> node $id="1" db \
> attributes standby="off"
> node $id="2" db2 \
> attributes standby="off"
> primitive ClusterIP ocf:heartbeat:IPaddr2 \
> params ip="192.168.0.100" cidr_netmask="24" \
> op monitor interval="30s"
> primitive DBase ocf:heartbeat:mysql \
> meta target-role="Started" \
> op start timeout="120s" interval="0" \
> op stop timeout="120s" interval="0" \
> op monitor interval="20s" timeout="30s"
> primitive DbFS ocf:heartbeat:Filesystem \
> params device="/dev/drbd0" directory="/sync" fstype="ext4" \
> op start timeout="60s" interval="0" \
> op stop timeout="180s" interval="0" \
> op monitor interval="60s" timeout="60s"
> primitive Links lsb:drbdlinks
> primitive r0 ocf:linbit:drbd \
> params drbd_resource="r0" \
> op monitor interval="29s" role="Master" \
> op start timeout="240s" interval="0" \
> op stop timeout="180s" interval="0" \
> op promote timeout="180s" interval="0" \
> op demote timeout="180s" interval="0" \
> op monitor interval="30s" role="Slave"
> group DbServer ClusterIP DbFS Links DBase
> ms ms_r0 r0 \
> meta master-max="1" master-node-max="1" clone-max="2" clone-node-max="1"
> notify="true" target-role="Master"
> location prefer-db DbServer 50: db
> colocation DbServer-with-ms_ro inf: DbServer ms_r0:Master
> order DbServer-after-ms_ro inf: ms_r0:promote DbServer:start
> property $id="cib-bootstrap-options" \
> dc-version="1.1.10-42f2063" \
> cluster-infrastructure="corosync" \
> expected-quorum-votes="2" \
> stonith-enabled="false" \
> no-quorum-policy="ignore" \
> last-lrm-refresh="1363370585"
>
> my corosync config:
>
> totem {
> version: 2
> token: 3000
> token_retransmits_before_loss_const: 10
> join: 60
> consensus: 3600
> vsftype: none
> max_messages: 20
> clear_node_high_bit: yes
> secauth: off
> threads: 0
> rrp_mode: none
> transport: udpu
> cluster_name: Dbcluster
> }
>
> nodelist {
> node {
> ring0_addr: db
> nodeid: 1
> }
> node {
> ring0_addr: db2
> nodeid: 2
> }
> }
>
> quorum {
> provider: corosync_votequorum
> }
>
> amf {
> mode: disabled
> }
>
> service {
> ver: 0
> name: pacemaker
> }
>
> aisexec {
> user: root
> group: root
> }
>
> logging {
> fileline: off
> to_stderr: yes
> to_logfile: yes
> logfile: /var/log/corosync/corosync.log
> to_syslog: no
> syslog_facility: daemon
> debug: off
> timestamp: on
> logger_subsys {
> subsys: AMF
> debug: off
> tags: enter|leave|trace1|trace2|trace3|trace4|trace6
> }
> }
>
> my drbd.conf:
>
> global {
> usage-count no;
> }
>
> common {
> protocol C;
>
> handlers {
> pri-on-incon-degr "/usr/lib/drbd/notify-pri-on-incon-degr.sh;
> /usr/lib/drbd/notify-emergency-reboot.sh; echo b > /proc/sysrq-trigger ;
> reboot -f";
> pri-lost-after-sb "/usr/lib/drbd/notify-pri-lost-after-sb.sh;
> /usr/lib/drbd/notify-emergency-reboot.sh; echo b > /proc/sysrq-trigger ;
> reboot -f";
> local-io-error "/usr/lib/drbd/notify-io-error.sh;
> /usr/lib/drbd/notify-emergency-shutdown.sh; echo o > /proc/sysrq-trigger ;
> halt -f";
> }
>
> startup {
> degr-wfc-timeout 120;
> }
>
> disk {
> on-io-error detach;
> }
>
> syncer {
> rate 100M;
> al-extents 257;
> }
> }
>
> resource r0 {
> protocol C;
> flexible-meta-disk internal;
>
> on db2 {
> address 192.168.0.10:7801 ;
> device /dev/drbd0 minor 0;
> disk /dev/sdb1;
> }
> on db {
> device /dev/drbd0 minor 0;
> disk /dev/db/sync;
> address 192.168.0.20:7801 ;
> }
> handlers {
> split-brain "/usr/lib/drbd/notify-split-brain.sh root";
> }
> net {
> after-sb-0pri discard-younger-primary; #discard-zero-changes;
> after-sb-1pri discard-secondary;
> after-sb-2pri call-pri-lost-after-sb;
> }
> }
>
> I have no idea, how to solve this problem. Maybe someone can help me.
>
> best regards,
>
> ariee
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>

_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org