Mailing List Archive

FW cluster fails at 4am
Hello all,

First, thanks in advance for any help anyone may provide. I've been battling
this problem off and on for months and it is driving me mad:

Once every week or two my cluster fails. For reasons unknown it seems to
initiate a failover and then the shorewall service (lsb) does not get started
(or is stopped). The majority of the time it happens just after 4am. Although
it has happened at other times, although much less frequently. Tonight I am
going to have to be up at 4am to poke around on the cluster and observe what is
happening, if anything.

One theory is some sort of resource starvation such as CPU but I've stress
tested it and run a backup and a big file copy through the firewall at the same
time and never get more than 1 core of cpu (almost all due to the backup) out
of 4 utilized and nothing interesting happening to pacemaker/resources.

My setup is a bit complicated in that I have 63 IPaddr2 resources plus the
shorewall resource. Plus order and colocation rules to make sure it all sticks
together the IPs come up before shorewall.

I am running the latest RHEL/CentOS RPMs in CentOS 6.5:

[root@new-fw1 shorewall]# rpm -qa |grep -i corosync
corosync-1.4.1-17.el6.x86_64
corosynclib-1.4.1-17.el6.x86_64
[root@new-fw1 shorewall]# rpm -qa |grep -i pacemaker
pacemaker-1.1.10-14.el6.x86_64
pacemaker-libs-1.1.10-14.el6.x86_64
pacemaker-cluster-libs-1.1.10-14.el6.x86_64
pacemaker-cli-1.1.10-14.el6.x86_64

I am a little concerned about how pacemaker manages the shorewall resource. It
usually fails to bring up shorewall after a failover event. Shorewall could
fail to start if the IP addresses shorewall is expecting to be on the
interfaces are not there yet. But I have dependencies to prevent this from ever
happening such as:

order shorewall-after-dmz-gw inf: dmz-gw shorewall

I also wonder if the shorewall init script is properly LSB compatible. It
wasn't out of the box and I had to make a minor change. But now it does seem to
be LSB compatible:

[root@new-fw2 ~]# /etc/init.d/shorewall status ; echo "result: $?"
Shorewall-4.5.0.1 Status at new-fw2.mydomain.com - Fri Dec 27 16:57:14 PST 2013

Shorewall is running
State:Started (Fri Dec 27 04:11:14 PST 2013) from /etc/shorewall/

result: 0
[root@new-fw2 ~]# /etc/init.d/shorewall stop ; echo "result: $?"
Shutting down shorewall: [ OK ]
result: 0
[root@new-fw2 ~]# /etc/init.d/shorewall status ; echo "result: $?"
Shorewall-4.5.0.1 Status at new-fw2.mydomain.com - Fri Dec 27 16:57:48 PST 2013

Shorewall is stopped
State:Stopped (Fri Dec 27 16:57:47 PST 2013)

result: 3
[root@new-fw2 ~]# /etc/init.d/shorewall start ; echo "result: $?"
Starting shorewall: Shorewall is already running
[ OK ]
result: 0
[root@new-fw2 ~]# /etc/init.d/shorewall status ; echo "result: $?"
Shorewall-4.5.0.1 Status at new-fw2.mydomain.com - Fri Dec 27 16:58:04 PST 2013

Shorewall is running
State:Started (Fri Dec 27 16:57:53 PST 2013) from /etc/shorewall/

result: 0

So it shouldn't be an LSB issue at this point...

I have a very hard time making heads or tails of the
/var/log/cluster/corosync.log log files. For example, I just had this appear in the log files:

Dec 27 19:56:31 [1551] new-fw1.mydomain.com pengine: info: LogActions: Leave spider2-eth0-40 (Started new-fw2.mydomain.com)
Dec 27 19:56:31 [1551] new-fw1.mydomain.com pengine: info: LogActions: Leave spider2-eth0-41 (Started new-fw2.mydomain.com)
Dec 27 19:56:31 [1551] new-fw1.mydomain.com pengine: info: LogActions: Leave corpsites (Started new-fw2.mydomain.com)
Dec 27 19:56:31 [1551] new-fw1.mydomain.com pengine: info: LogActions: Leave dbrw (Started new-fw2.mydomain.com)
Dec 27 19:56:31 [1551] new-fw1.mydomain.com pengine: info: LogActions: Leave mjhdev (Started new-fw2.mydomain.com)
Dec 27 19:56:31 [1551] new-fw1.mydomain.com pengine: info: LogActions: Leave datapass1-ssl-eth0-2 (Started new-fw2.mydomain.com)
Dec 27 19:56:31 [1551] new-fw1.mydomain.com pengine: info: LogActions: Leave datapass2-ssl-eth0-2 (Started new-fw2.mydomain.com)
Dec 27 19:56:31 [1551] new-fw1.mydomain.com pengine: info: LogActions: Leave datapass2-ssl-eth0-1 (Started new-fw2.mydomain.com)
Dec 27 19:56:31 [1551] new-fw1.mydomain.com pengine: info: LogActions: Leave datapass2-ssl-eth0 (Started new-fw2.mydomain.com)
Dec 27 19:56:31 [1551] new-fw1.mydomain.com pengine: info: LogActions: Leave rrdev2 (Started new-fw2.mydomain.com)
Dec 27 19:56:31 [1551] new-fw1.mydomain.com pengine: info: LogActions: Leave webmail-resumepromotion (Started new-fw2.mydomain.com)
Dec 27 19:56:31 [1553] new-fw1.mydomain.com crmd: info: do_state_transition: State transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS cause=C_IPC_MESSAGE origin=handle_response ]
Dec 27 19:56:31 [1553] new-fw1.mydomain.com crmd: info: do_te_invoke: Processing graph 599 (ref=pe_calc-dc-1388202991-1545) derived from /var/lib/pacemaker/pengine/pe-input-894.bz2
Dec 27 19:56:31 [1553] new-fw1.mydomain.com crmd: notice: run_graph: Transition 599 (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-894.bz2): Complete
Dec 27 19:56:31 [1553] new-fw1.mydomain.com crmd: info: do_log: FSA: Input I_TE_SUCCESS from notify_crmd() received in state S_TRANSITION_ENGINE
Dec 27 19:56:31 [1553] new-fw1.mydomain.com crmd: notice: do_state_transition: State transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL origin=notify_crmd ]
Dec 27 19:56:31 [1551] new-fw1.mydomain.com pengine: notice: process_pe_message: Calculated Transition 599: /var/lib/pacemaker/pengine/pe-input-894.bz2

Is this normal? What does "Leave" mean? This is on the inactive node. The
active node doesn't seem to be making any log entries at all.

My corosync.conf and full crm config are as follows. There has also been
concern expressed about the size of our crm config. Is this unreasonable? I am
going to attempt to attach the 120k bzipped logfile to this email but odds are
the list won't take attachments. If it doesn't make it and anyone wants to see
it I will figure out somewhere else to stash it.

The logs are so voluminous that it is hard to know which parts are necessary.

corosync.conf:

# Please read the corosync.conf.5 manual page
compatibility: whitetank

totem {
version: 2
secauth: off
threads: 0
interface {
ringnumber: 0
bindnetaddr: 10.0.2.0
mcastaddr: 226.94.1.1
mcastport: 5000
member {
memberaddr: 10.0.2.83
}
member {
memberaddr: 10.0.2.84
}
}
transport udpu
}

logging {
fileline: off
to_stderr: no
to_logfile: yes
to_syslog: yes
logfile: /var/log/cluster/corosync.log
debug: off
timestamp: on
logger_subsys {
subsys: AMF
debug: off
}
}

amf {
mode: disabled
}


crm configure show:

node new-fw1.mydomain.com \
attributes standby="off"
node new-fw2.mydomain.com \
attributes standby="off"
primitive actonomy ocf:heartbeat:IPaddr2 \
params ip="206.71.189.165" nic="eth1" cidr_netmask="25" \
op monitor interval="30s" \
meta is-managed="true"
primitive corpsites ocf:heartbeat:IPaddr2 \
params ip="206.71.189.150" nic="eth1" cidr_netmask="25" \
op monitor interval="30s" \
meta target-role="Started"
primitive datapass1-ssl-eth0-2 ocf:heartbeat:IPaddr2 \
params ip="206.71.189.166" nic="eth1" cidr_netmask="25" \
op monitor interval="30s"
primitive datapass2-ssl-eth0 ocf:heartbeat:IPaddr2 \
params ip="206.71.189.161" nic="eth1" cidr_netmask="25" \
op monitor interval="30s"
primitive datapass2-ssl-eth0-1 ocf:heartbeat:IPaddr2 \
params ip="206.71.189.162" nic="eth1" cidr_netmask="25" \
op monitor interval="30s"
primitive datapass2-ssl-eth0-2 ocf:heartbeat:IPaddr2 \
params ip="206.71.189.167" nic="eth1" cidr_netmask="32" \
op monitor interval="30s"
primitive dbrw ocf:heartbeat:IPaddr2 \
params ip="206.71.189.250" nic="eth1" cidr_netmask="32" \
op monitor interval="30s" \
meta target-role="Started"
primitive dmz-gw ocf:heartbeat:IPaddr2 \
params ip="10.0.2.254" cidr_netmask="32" \
op monitor interval="30s"
primitive mjhdev ocf:heartbeat:IPaddr2 \
params ip="209.216.236.138" nic="eth1" cidr_netmask="32" \
op monitor interval="30s"
primitive mjhwiki ocf:heartbeat:IPaddr2 \
params ip="206.71.189.163" nic="eth1" cidr_netmask="32" \
op monitor interval="30s"
primitive mx1 ocf:heartbeat:IPaddr2 \
params ip="216.240.176.65" nic="eth2" cidr_netmask="32" \
op monitor interval="30s"
primitive pin1 ocf:heartbeat:IPaddr2 \
params ip="206.71.189.147" nic="eth1" cidr_netmask="32" \
op monitor interval="30s"
primitive pin2 ocf:heartbeat:IPaddr2 \
params ip="206.71.189.146" nic="eth1" cidr_netmask="32" \
op monitor interval="30s"
primitive pin3 ocf:heartbeat:IPaddr2 \
params ip="206.71.189.145" nic="eth1" cidr_netmask="32" \
op monitor interval="30s"
primitive rabbit ocf:heartbeat:IPaddr2 \
params ip="206.71.189.149" nic="eth1" cidr_netmask="32" \
op monitor interval="30s"
primitive reza ocf:heartbeat:IPaddr2 \
params ip="216.240.176.66" nic="eth2" cidr_netmask="32" \
op monitor interval="30s"
primitive rrdev ocf:heartbeat:IPaddr2 \
params ip="206.71.189.164" nic="eth1" cidr_netmask="32" \
op monitor interval="30s"
primitive rrdev2 ocf:heartbeat:IPaddr2 \
params ip="206.71.189.240" nic="eth1" cidr_netmask="32" \
op monitor interval="30s"
primitive shorewall lsb:shorewall \
op monitor interval="60s" \
meta target-role="Started" is-managed="true"
primitive spider2-eth0-1 ocf:heartbeat:IPaddr2 \
params ip="216.240.176.67" nic="eth2" cidr_netmask="32" \
op monitor interval="30s"
primitive spider2-eth0-10 ocf:heartbeat:IPaddr2 \
params ip="216.240.176.76" nic="eth2" cidr_netmask="32" \
op monitor interval="30s"
primitive spider2-eth0-11 ocf:heartbeat:IPaddr2 \
params ip="216.240.176.77" nic="eth2" cidr_netmask="32" \
op monitor interval="30s"
primitive spider2-eth0-12 ocf:heartbeat:IPaddr2 \
params ip="216.240.176.78" nic="eth2" cidr_netmask="32" \
op monitor interval="30s"
primitive spider2-eth0-13 ocf:heartbeat:IPaddr2 \
params ip="216.240.176.79" nic="eth2" cidr_netmask="32" \
op monitor interval="30s"
primitive spider2-eth0-14 ocf:heartbeat:IPaddr2 \
params ip="216.240.176.80" nic="eth2" cidr_netmask="32" \
op monitor interval="30s"
primitive spider2-eth0-15 ocf:heartbeat:IPaddr2 \
params ip="216.240.176.81" nic="eth2" cidr_netmask="32" \
op monitor interval="30s"
primitive spider2-eth0-16 ocf:heartbeat:IPaddr2 \
params ip="216.240.176.82" nic="eth2" cidr_netmask="32" \
op monitor interval="30s"
primitive spider2-eth0-17 ocf:heartbeat:IPaddr2 \
params ip="216.240.176.83" nic="eth2" cidr_netmask="32" \
op monitor interval="30s"
primitive spider2-eth0-18 ocf:heartbeat:IPaddr2 \
params ip="216.240.176.84" nic="eth2" cidr_netmask="32" \
op monitor interval="30s"
primitive spider2-eth0-19 ocf:heartbeat:IPaddr2 \
params ip="216.240.176.85" nic="eth2" cidr_netmask="32" \
op monitor interval="30s"
primitive spider2-eth0-2 ocf:heartbeat:IPaddr2 \
params ip="216.240.176.68" nic="eth2" cidr_netmask="32" \
op monitor interval="30s"
primitive spider2-eth0-20 ocf:heartbeat:IPaddr2 \
params ip="216.240.176.86" nic="eth2" cidr_netmask="32" \
op monitor interval="30s"
primitive spider2-eth0-21 ocf:heartbeat:IPaddr2 \
params ip="216.240.176.87" nic="eth2" cidr_netmask="32" \
op monitor interval="30s"
primitive spider2-eth0-22 ocf:heartbeat:IPaddr2 \
params ip="216.240.176.88" nic="eth2" cidr_netmask="32" \
op monitor interval="30s"
primitive spider2-eth0-23 ocf:heartbeat:IPaddr2 \
params ip="216.240.176.89" nic="eth2" cidr_netmask="32" \
op monitor interval="30s"
primitive spider2-eth0-24 ocf:heartbeat:IPaddr2 \
params ip="216.240.176.90" nic="eth2" cidr_netmask="32" \
op monitor interval="30s"
primitive spider2-eth0-25 ocf:heartbeat:IPaddr2 \
params ip="216.240.176.91" nic="eth2" cidr_netmask="32" \
op monitor interval="30s"
primitive spider2-eth0-26 ocf:heartbeat:IPaddr2 \
params ip="216.240.176.92" nic="eth2" cidr_netmask="32" \
op monitor interval="30s"
primitive spider2-eth0-27 ocf:heartbeat:IPaddr2 \
params ip="216.240.176.93" nic="eth2" cidr_netmask="32" \
op monitor interval="30s"
primitive spider2-eth0-28 ocf:heartbeat:IPaddr2 \
params ip="216.240.176.94" nic="eth2" cidr_netmask="32" \
op monitor interval="30s"
primitive spider2-eth0-29 ocf:heartbeat:IPaddr2 \
params ip="216.240.176.95" nic="eth2" cidr_netmask="32" \
op monitor interval="30s"
primitive spider2-eth0-3 ocf:heartbeat:IPaddr2 \
params ip="216.240.176.69" nic="eth2" cidr_netmask="32" \
op monitor interval="30s"
primitive spider2-eth0-30 ocf:heartbeat:IPaddr2 \
params ip="216.240.176.96" nic="eth2" cidr_netmask="32" \
op monitor interval="30s"
primitive spider2-eth0-31 ocf:heartbeat:IPaddr2 \
params ip="216.240.176.97" nic="eth2" cidr_netmask="32" \
op monitor interval="30s"
primitive spider2-eth0-32 ocf:heartbeat:IPaddr2 \
params ip="216.240.176.98" nic="eth2" cidr_netmask="32" \
op monitor interval="30s"
primitive spider2-eth0-33 ocf:heartbeat:IPaddr2 \
params ip="216.240.176.99" nic="eth2" cidr_netmask="32" \
op monitor interval="30s"
primitive spider2-eth0-34 ocf:heartbeat:IPaddr2 \
params ip="216.240.176.100" nic="eth2" cidr_netmask="32" \
op monitor interval="30s"
primitive spider2-eth0-35 ocf:heartbeat:IPaddr2 \
params ip="216.240.176.101" nic="eth2" cidr_netmask="32" \
op monitor interval="30s"
primitive spider2-eth0-36 ocf:heartbeat:IPaddr2 \
params ip="216.240.176.102" nic="eth2" cidr_netmask="32" \
op monitor interval="30s"
primitive spider2-eth0-37 ocf:heartbeat:IPaddr2 \
params ip="216.240.176.103" nic="eth2" cidr_netmask="32" \
op monitor interval="30s"
primitive spider2-eth0-38 ocf:heartbeat:IPaddr2 \
params ip="216.240.176.104" nic="eth2" cidr_netmask="32" \
op monitor interval="30s"
primitive spider2-eth0-39 ocf:heartbeat:IPaddr2 \
params ip="216.240.176.105" nic="eth2" cidr_netmask="32" \
op monitor interval="30s"
primitive spider2-eth0-4 ocf:heartbeat:IPaddr2 \
params ip="216.240.176.70" nic="eth2" cidr_netmask="32" \
op monitor interval="30s"
primitive spider2-eth0-40 ocf:heartbeat:IPaddr2 \
params ip="216.240.176.106" nic="eth2" cidr_netmask="32" \
op monitor interval="30s"
primitive spider2-eth0-41 ocf:heartbeat:IPaddr2 \
params ip="216.240.176.107" nic="eth2" cidr_netmask="32" \
op monitor interval="30s"
primitive spider2-eth0-5 ocf:heartbeat:IPaddr2 \
params ip="216.240.176.71" nic="eth2" cidr_netmask="32" \
op monitor interval="30s"
primitive spider2-eth0-6 ocf:heartbeat:IPaddr2 \
params ip="216.240.176.72" nic="eth2" cidr_netmask="32" \
op monitor interval="30s"
primitive spider2-eth0-7 ocf:heartbeat:IPaddr2 \
params ip="216.240.176.73" nic="eth2" cidr_netmask="32" \
op monitor interval="30s"
primitive spider2-eth0-8 ocf:heartbeat:IPaddr2 \
params ip="216.240.176.74" nic="eth2" cidr_netmask="32" \
op monitor interval="30s"
primitive spider2-eth0-9 ocf:heartbeat:IPaddr2 \
params ip="216.240.176.75" nic="eth2" cidr_netmask="32" \
op monitor interval="30s"
primitive sugar ocf:heartbeat:IPaddr2 \
params ip="206.71.189.130" nic="eth1" cidr_netmask="25" \
op monitor interval="30s"
primitive webmail ocf:heartbeat:IPaddr2 \
params ip="206.71.189.254" nic="eth1" cidr_netmask="25" \
op monitor interval="30s"
primitive webmail-resumepromotion ocf:heartbeat:IPaddr2 \
params ip="206.71.189.242" nic="eth1" cidr_netmask="25" \
op monitor interval="30s"
primitive wiki ocf:heartbeat:IPaddr2 \
params ip="209.216.236.135" nic="eth1" cidr_netmask="32" \
op monitor interval="30s"
colocation dmz-gw-with-actonomy inf: dmz-gw actonomy
colocation dmz-gw-with-mjhwiki inf: dmz-gw mjhwiki
colocation dmz-gw-with-mx1 inf: dmz-gw mx1
colocation dmz-gw-with-pin1 inf: dmz-gw pin1
colocation dmz-gw-with-pin2 inf: dmz-gw pin2
colocation dmz-gw-with-pin3 inf: dmz-gw pin3
colocation dmz-gw-with-rabbit inf: dmz-gw rabbit
colocation dmz-gw-with-reza inf: dmz-gw reza
colocation dmz-gw-with-rrdev inf: dmz-gw rrdev
colocation dmz-gw-with-spider2-eth0-1 inf: dmz-gw spider2-eth0-1
colocation dmz-gw-with-spider2-eth0-10 inf: dmz-gw spider2-eth0-10
colocation dmz-gw-with-spider2-eth0-11 inf: dmz-gw spider2-eth0-11
colocation dmz-gw-with-spider2-eth0-12 inf: dmz-gw spider2-eth0-12
colocation dmz-gw-with-spider2-eth0-13 inf: dmz-gw spider2-eth0-13
colocation dmz-gw-with-spider2-eth0-14 inf: dmz-gw spider2-eth0-14
colocation dmz-gw-with-spider2-eth0-15 inf: dmz-gw spider2-eth0-15
colocation dmz-gw-with-spider2-eth0-16 inf: dmz-gw spider2-eth0-16
colocation dmz-gw-with-spider2-eth0-17 inf: dmz-gw spider2-eth0-17
colocation dmz-gw-with-spider2-eth0-18 inf: dmz-gw spider2-eth0-18
colocation dmz-gw-with-spider2-eth0-19 inf: dmz-gw spider2-eth0-19
colocation dmz-gw-with-spider2-eth0-2 inf: dmz-gw spider2-eth0-2
colocation dmz-gw-with-spider2-eth0-20 inf: dmz-gw spider2-eth0-20
colocation dmz-gw-with-spider2-eth0-21 inf: dmz-gw spider2-eth0-21
colocation dmz-gw-with-spider2-eth0-22 inf: dmz-gw spider2-eth0-22
colocation dmz-gw-with-spider2-eth0-23 inf: dmz-gw spider2-eth0-23
colocation dmz-gw-with-spider2-eth0-24 inf: dmz-gw spider2-eth0-24
colocation dmz-gw-with-spider2-eth0-25 inf: dmz-gw spider2-eth0-25
colocation dmz-gw-with-spider2-eth0-26 inf: dmz-gw spider2-eth0-26
colocation dmz-gw-with-spider2-eth0-27 inf: dmz-gw spider2-eth0-27
colocation dmz-gw-with-spider2-eth0-28 inf: dmz-gw spider2-eth0-28
colocation dmz-gw-with-spider2-eth0-29 inf: dmz-gw spider2-eth0-29
colocation dmz-gw-with-spider2-eth0-3 inf: dmz-gw spider2-eth0-3
colocation dmz-gw-with-spider2-eth0-30 inf: dmz-gw spider2-eth0-30
colocation dmz-gw-with-spider2-eth0-31 inf: dmz-gw spider2-eth0-31
colocation dmz-gw-with-spider2-eth0-32 inf: dmz-gw spider2-eth0-32
colocation dmz-gw-with-spider2-eth0-33 inf: dmz-gw spider2-eth0-33
colocation dmz-gw-with-spider2-eth0-34 inf: dmz-gw spider2-eth0-34
colocation dmz-gw-with-spider2-eth0-35 inf: dmz-gw spider2-eth0-35
colocation dmz-gw-with-spider2-eth0-36 inf: dmz-gw spider2-eth0-36
colocation dmz-gw-with-spider2-eth0-37 inf: dmz-gw spider2-eth0-37
colocation dmz-gw-with-spider2-eth0-38 inf: dmz-gw spider2-eth0-38
colocation dmz-gw-with-spider2-eth0-39 inf: dmz-gw spider2-eth0-39
colocation dmz-gw-with-spider2-eth0-4 inf: dmz-gw spider2-eth0-4
colocation dmz-gw-with-spider2-eth0-40 inf: dmz-gw spider2-eth0-40
colocation dmz-gw-with-spider2-eth0-41 inf: dmz-gw spider2-eth0-41
colocation dmz-gw-with-spider2-eth0-5 inf: dmz-gw spider2-eth0-5
colocation dmz-gw-with-spider2-eth0-6 inf: dmz-gw spider2-eth0-6
colocation dmz-gw-with-spider2-eth0-7 inf: dmz-gw spider2-eth0-7
colocation dmz-gw-with-spider2-eth0-8 inf: dmz-gw spider2-eth0-8
colocation dmz-gw-with-spider2-eth0-9 inf: dmz-gw spider2-eth0-9
colocation dmz-gw-with-sugar inf: dmz-gw sugar
colocation dmz-gw-with-webmail inf: dmz-gw webmail
colocation dmz-gw-with-wiki inf: dmz-gw wiki
colocation shorewall-with-actonomy inf: shorewall actonomy
colocation shorewall-with-corpsites inf: shorewall corpsites
colocation shorewall-with-datapass1-ssl-eth0-2 inf: shorewall datapass1-ssl-eth0-2
colocation shorewall-with-datapass2-ssl-eth0 inf: shorewall datapass2-ssl-eth0
colocation shorewall-with-datapass2-ssl-eth0-1 inf: shorewall datapass2-ssl-eth0-1
colocation shorewall-with-datapass2-ssl-eth0-2 inf: shorewall datapass2-ssl-eth0-2
colocation shorewall-with-dbrw inf: shorewall dbrw
colocation shorewall-with-dmz-gw inf: shorewall dmz-gw
colocation shorewall-with-mjhdev inf: shorewall mjhdev
colocation shorewall-with-mjhwiki inf: shorewall mjhwiki
colocation shorewall-with-mx1 inf: shorewall mx1
colocation shorewall-with-pin1 inf: shorewall pin1
colocation shorewall-with-pin2 inf: shorewall pin2
colocation shorewall-with-pin3 inf: shorewall pin3
colocation shorewall-with-rabbit inf: shorewall rabbit
colocation shorewall-with-reza inf: shorewall reza
colocation shorewall-with-rrdev inf: shorewall rrdev
colocation shorewall-with-rrdev2 inf: shorewall rrdev2
colocation shorewall-with-spider2-eth0-1 inf: shorewall spider2-eth0-1
colocation shorewall-with-spider2-eth0-10 inf: shorewall spider2-eth0-10
colocation shorewall-with-spider2-eth0-11 inf: shorewall spider2-eth0-11
colocation shorewall-with-spider2-eth0-12 inf: shorewall spider2-eth0-12
colocation shorewall-with-spider2-eth0-13 inf: shorewall spider2-eth0-13
colocation shorewall-with-spider2-eth0-14 inf: shorewall spider2-eth0-14
colocation shorewall-with-spider2-eth0-15 inf: shorewall spider2-eth0-15
colocation shorewall-with-spider2-eth0-16 inf: shorewall spider2-eth0-16
colocation shorewall-with-spider2-eth0-17 inf: shorewall spider2-eth0-17
colocation shorewall-with-spider2-eth0-18 inf: shorewall spider2-eth0-18
colocation shorewall-with-spider2-eth0-19 inf: shorewall spider2-eth0-19
colocation shorewall-with-spider2-eth0-2 inf: shorewall spider2-eth0-2
colocation shorewall-with-spider2-eth0-20 inf: shorewall spider2-eth0-20
colocation shorewall-with-spider2-eth0-21 inf: shorewall spider2-eth0-21
colocation shorewall-with-spider2-eth0-22 inf: shorewall spider2-eth0-22
colocation shorewall-with-spider2-eth0-23 inf: shorewall spider2-eth0-23
colocation shorewall-with-spider2-eth0-24 inf: shorewall spider2-eth0-24
colocation shorewall-with-spider2-eth0-25 inf: shorewall spider2-eth0-25
colocation shorewall-with-spider2-eth0-26 inf: shorewall spider2-eth0-26
colocation shorewall-with-spider2-eth0-27 inf: shorewall spider2-eth0-27
colocation shorewall-with-spider2-eth0-28 inf: shorewall spider2-eth0-28
colocation shorewall-with-spider2-eth0-29 inf: shorewall spider2-eth0-29
colocation shorewall-with-spider2-eth0-3 inf: shorewall spider2-eth0-3
colocation shorewall-with-spider2-eth0-30 inf: shorewall spider2-eth0-30
colocation shorewall-with-spider2-eth0-31 inf: shorewall spider2-eth0-31
colocation shorewall-with-spider2-eth0-32 inf: shorewall spider2-eth0-32
colocation shorewall-with-spider2-eth0-33 inf: shorewall spider2-eth0-33
colocation shorewall-with-spider2-eth0-34 inf: shorewall spider2-eth0-34
colocation shorewall-with-spider2-eth0-35 inf: shorewall spider2-eth0-35
colocation shorewall-with-spider2-eth0-36 inf: shorewall spider2-eth0-36
colocation shorewall-with-spider2-eth0-37 inf: shorewall spider2-eth0-37
colocation shorewall-with-spider2-eth0-38 inf: shorewall spider2-eth0-38
colocation shorewall-with-spider2-eth0-39 inf: shorewall spider2-eth0-39
colocation shorewall-with-spider2-eth0-4 inf: shorewall spider2-eth0-4
colocation shorewall-with-spider2-eth0-40 inf: shorewall spider2-eth0-40
colocation shorewall-with-spider2-eth0-41 inf: shorewall spider2-eth0-41
colocation shorewall-with-spider2-eth0-5 inf: shorewall spider2-eth0-5
colocation shorewall-with-spider2-eth0-6 inf: shorewall spider2-eth0-6
colocation shorewall-with-spider2-eth0-7 inf: shorewall spider2-eth0-7
colocation shorewall-with-spider2-eth0-8 inf: shorewall spider2-eth0-8
colocation shorewall-with-spider2-eth0-9 inf: shorewall spider2-eth0-9
colocation shorewall-with-sugar inf: shorewall sugar
colocation shorewall-with-webmail inf: shorewall webmail
colocation shorewall-with-webmail-resumepromotion inf: shorewall webmail-resumepromotion
colocation shorewall-with-wiki inf: shorewall wiki
order shorewall-after-actonomy inf: actonomy shorewall
order shorewall-after-corpsites inf: corpsites shorewall
order shorewall-after-datapass1-ssl-eth0-2 inf: datapass1-ssl-eth0-2 shorewall
order shorewall-after-datapass2-ssl-eth0 inf: datapass2-ssl-eth0 shorewall
order shorewall-after-datapass2-ssl-eth0-1 inf: datapass2-ssl-eth0-1 shorewall
order shorewall-after-datapass2-ssl-eth0-2 inf: datapass2-ssl-eth0-2 shorewall
order shorewall-after-dbrw inf: sugar dbrw
order shorewall-after-dmz-gw inf: dmz-gw shorewall
order shorewall-after-mjhdev inf: mjhdev shorewall
order shorewall-after-mjhwiki inf: mjhwiki shorewall
order shorewall-after-mx1 inf: mx1 shorewall
order shorewall-after-pin1 inf: pin1 shorewall
order shorewall-after-pin2 inf: pin2 shorewall
order shorewall-after-pin3 inf: pin3 shorewall
order shorewall-after-rabbit inf: rabbit shorewall
order shorewall-after-reza inf: reza shorewall
order shorewall-after-rrdev inf: rrdev shorewall
order shorewall-after-rrdev2 inf: rrdev2 shorewall
order shorewall-after-spider2-eth0-1 inf: spider2-eth0-1 shorewall
order shorewall-after-spider2-eth0-10 inf: spider2-eth0-10 shorewall
order shorewall-after-spider2-eth0-11 inf: spider2-eth0-11 shorewall
order shorewall-after-spider2-eth0-12 inf: spider2-eth0-12 shorewall
order shorewall-after-spider2-eth0-13 inf: spider2-eth0-13 shorewall
order shorewall-after-spider2-eth0-14 inf: spider2-eth0-14 shorewall
order shorewall-after-spider2-eth0-15 inf: spider2-eth0-15 shorewall
order shorewall-after-spider2-eth0-16 inf: spider2-eth0-16 shorewall
order shorewall-after-spider2-eth0-17 inf: spider2-eth0-17 shorewall
order shorewall-after-spider2-eth0-18 inf: spider2-eth0-18 shorewall
order shorewall-after-spider2-eth0-19 inf: spider2-eth0-19 shorewall
order shorewall-after-spider2-eth0-2 inf: spider2-eth0-2 shorewall
order shorewall-after-spider2-eth0-20 inf: spider2-eth0-20 shorewall
order shorewall-after-spider2-eth0-21 inf: spider2-eth0-21 shorewall
order shorewall-after-spider2-eth0-22 inf: spider2-eth0-22 shorewall
order shorewall-after-spider2-eth0-23 inf: spider2-eth0-23 shorewall
order shorewall-after-spider2-eth0-24 inf: spider2-eth0-24 shorewall
order shorewall-after-spider2-eth0-25 inf: spider2-eth0-25 shorewall
order shorewall-after-spider2-eth0-26 inf: spider2-eth0-26 shorewall
order shorewall-after-spider2-eth0-27 inf: spider2-eth0-27 shorewall
order shorewall-after-spider2-eth0-28 inf: spider2-eth0-28 shorewall
order shorewall-after-spider2-eth0-29 inf: spider2-eth0-29 shorewall
order shorewall-after-spider2-eth0-3 inf: spider2-eth0-3 shorewall
order shorewall-after-spider2-eth0-30 inf: spider2-eth0-30 shorewall
order shorewall-after-spider2-eth0-31 inf: spider2-eth0-31 shorewall
order shorewall-after-spider2-eth0-32 inf: spider2-eth0-32 shorewall
order shorewall-after-spider2-eth0-33 inf: spider2-eth0-33 shorewall
order shorewall-after-spider2-eth0-34 inf: spider2-eth0-34 shorewall
order shorewall-after-spider2-eth0-35 inf: spider2-eth0-35 shorewall
order shorewall-after-spider2-eth0-36 inf: spider2-eth0-36 shorewall
order shorewall-after-spider2-eth0-37 inf: spider2-eth0-37 shorewall
order shorewall-after-spider2-eth0-38 inf: spider2-eth0-38 shorewall
order shorewall-after-spider2-eth0-39 inf: spider2-eth0-39 shorewall
order shorewall-after-spider2-eth0-4 inf: spider2-eth0-4 shorewall
order shorewall-after-spider2-eth0-40 inf: spider2-eth0-40 shorewall
order shorewall-after-spider2-eth0-41 inf: spider2-eth0-41 shorewall
order shorewall-after-spider2-eth0-5 inf: spider2-eth0-5 shorewall
order shorewall-after-spider2-eth0-6 inf: spider2-eth0-6 shorewall
order shorewall-after-spider2-eth0-7 inf: spider2-eth0-7 shorewall
order shorewall-after-spider2-eth0-8 inf: spider2-eth0-8 shorewall
order shorewall-after-spider2-eth0-9 inf: spider2-eth0-9 shorewall
order shorewall-after-sugar inf: sugar shorewall
order shorewall-after-webmail inf: webmail shorewall
order shorewall-after-webmail-resumepromotion inf: webmail-resumepromotion shorewall
order shorewall-after-wiki inf: wiki shorewall
property $id="cib-bootstrap-options" \
dc-version="1.1.10-14.el6-368c726" \
cluster-infrastructure="classic openais (with plugin)" \
expected-quorum-votes="2" \
stonith-enabled="false" \
last-lrm-refresh="1388199388" \
no-quorum-policy="ignore" \
maintenance-mode="false"
rsc_defaults $id="rsc-options" \
resource-stickiness="100"

--
Tracy Reed
Re: FW cluster fails at 4am [ In reply to ]
On 2013-12-28 04:34, Tracy Reed wrote:
> First, thanks in advance for any help anyone may provide. I've been battling
> this problem off and on for months and it is driving me mad:
>
> Once every week or two my cluster fails. For reasons unknown it seems to
> initiate a failover and then the shorewall service (lsb) does not get started
> (or is stopped). The majority of the time it happens just after 4am. Although
> it has happened at other times, although much less frequently. Tonight I am
> going to have to be up at 4am to poke around on the cluster and observe what is
> happening, if anything.
[snip]

Log rotation tends to run around that time on Red Hat. Check your
logrotate configuration. Maybe something is rotating corosync logs and
using the wrong signal to start a new log file.

Or, if not that, could it be some other cronned task?
_______________________________________________
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems
Re: FW cluster fails at 4am [ In reply to ]
On Fri, Dec 27, 2013 at 08:54:17PM PST, Jefferson Ogata spake thusly:
> Log rotation tends to run around that time on Red Hat. Check your logrotate
> configuration. Maybe something is rotating corosync logs and using the wrong
> signal to start a new log file.

That was actually the first thing I looked at! I found
/etc/logrotate.d/shorewall and removed it. But that seems to have had no effect
on the problem. That file has been gone for 3 weeks, the machines rebooted (not
that it should matter), and the problem has happened several times since then.

I've searched all over and can't find anything. And it doesn't even happen
every morning, just every week or two. Hard to nail down a real pattern other
than usually (not always) 4am.

> Or, if not that, could it be some other cronned task?

These firewall machines are standard CentOS boxes. The stock crons (logrotate
etc) and a 5 minute nagios passive check are the only things on them as far as
I can tell. Although I haven't quite figured out what causes logrotate to run
at 4am. I know it is in the /etc/cron.daily/logrotate but what runs this at
4am? Is 4am some special hard-coded time in crond?

I just noticed that there is an /etc/logrotate.d/cman which rotates
/var/log/cluster/*log Could this somehow be an issue? I'm running pacemaker and
corosync but I'm not running cman:

# /etc/init.d/cman status
cman is not running

Should I be? I don't think it is necessary for this particular kind of
cluster... But since it isn't running it shouldn't matter.

Oddly, I just noticed this in my tail -f of the logs (no idea what triggered
it, but I did run /etc/init.d/cman status on the other node) which actually
mentions cman:

# Dec 27 21:55:05 [1541] new-fw2.mydomain.com cib: info: crm_client_new: Connecting 0x2485d00 for uid=0 gid=0 pid=11103 id=b507c867-cbde-4508-8813-1439720f9c6b
Dec 27 21:55:05 [1541] new-fw2.mydomain.com cib: info: cib_process_request: Completed cib_query operation for section 'all': OK (rc=0, origin=local/crm_mon/2, version=0.391.5)
Dec 27 21:55:05 [1541] new-fw2.mydomain.com cib: info: crm_compress_string: Compressed 201733 bytes into 12221 (ratio 16:1) in 69ms
Dec 27 21:55:05 [1541] new-fw2.mydomain.com cib: info: crm_client_destroy: Destroying 0 events
Dec 27 21:55:05 [1541] new-fw2.mydomain.com cib: info: crm_client_new: Connecting 0x2485d00 for uid=0 gid=0 pid=11105 id=7e95fe48-ca73-4f29-b2b7-e43596fab588
Dec 27 21:55:05 [1541] new-fw2.mydomain.com cib: info: cib_process_request: Completed cib_query operation for section 'all': OK (rc=0, origin=local/cibadmin/2, version=0.391.5)
Dec 27 21:55:05 [1541] new-fw2.mydomain.com cib: info: crm_compress_string: Compressed 201732 bytes into 12220 (ratio 16:1) in 61ms
Dec 27 21:55:05 [1541] new-fw2.mydomain.com cib: info: crm_client_destroy: Destroying 0 events
Set r/w permissions for uid=0, gid=0 on /var/log/cluster/corosync.log
Dec 27 21:55:07 corosync [pcmk ] info: process_ais_conf: Reading configure
Dec 27 21:55:07 corosync [pcmk ] ERROR: process_ais_conf: You have configured a cluster using the Pacemaker plugin for Corosync. The plugin is not supported in this environment and will be removed very soon.
Dec 27 21:55:07 corosync [pcmk ] ERROR: process_ais_conf: Please see Chapter 8 of 'Clusters from Scratch' (http://www.clusterlabs.org/doc) for details on using Pacemaker with CMAN
Dec 27 21:55:07 corosync [pcmk ] info: config_find_init: Local handle: 7178156903111852040 for logging
Dec 27 21:55:07 corosync [pcmk ] info: config_find_next: Processing additional logging options...
Dec 27 21:55:07 corosync [pcmk ] info: get_config_opt: Found 'off' for option: debug
Dec 27 21:55:07 corosync [pcmk ] info: get_config_opt: Found 'yes' for option: to_logfile
Dec 27 21:55:07 corosync [pcmk ] info: get_config_opt: Found '/var/log/cluster/corosync.log' for option: logfile
Dec 27 21:55:07 corosync [pcmk ] info: get_config_opt: Found 'yes' for option: to_syslog
Dec 27 21:55:07 corosync [pcmk ] info: get_config_opt: Defaulting to 'daemon' for option: syslog_facility
Dec 27 21:55:07 corosync [pcmk ] info: config_find_init: Local handle: 5773499849093677065 for quorum
Dec 27 21:55:07 corosync [pcmk ] info: config_find_next: No additional configuration supplied for: quorum
Dec 27 21:55:07 corosync [pcmk ] info: get_config_opt: No default for option: provider
Dec 27 21:55:07 corosync [pcmk ] info: config_find_init: Local handle: 7711695921217536010 for service
Dec 27 21:55:07 corosync [pcmk ] info: config_find_next: Processing additional service options...
Dec 27 21:55:07 corosync [pcmk ] info: get_config_opt: Found '1' for option: ver
Dec 27 21:55:07 corosync [pcmk ] info: process_ais_conf: Enabling MCP mode: Use the Pacemaker init script to complete Pacemaker startup
Dec 27 21:55:07 corosync [pcmk ] info: get_config_opt: Defaulting to 'pcmk' for option: clustername
Dec 27 21:55:07 corosync [pcmk ] info: get_config_opt: Defaulting to 'no' for option: use_logd
Dec 27 21:55:07 corosync [pcmk ] info: get_config_opt: Defaulting to 'no' for option: use_mgmtd
Dec 27 21:55:07 corosync [pcmk ] info: pcmk_exec_dump: Local id: 1409417226, uname: new-fw2.mydomain.com, born: 4480
Dec 27 21:55:07 corosync [pcmk ] info: pcmk_exec_dump: Membership id: 4480, quorate: true, expected: 2, actual: 2
Dec 27 21:55:07 corosync [pcmk ] info: member_dump_fn: node id:1409417226, uname=new-fw2.mydomain.com state=member processes=0000000000000000 born=4480 seen=4480 addr=r(0) ip(10.0.2.84) version=1.1.10-14.el6
Dec 27 21:55:07 corosync [pcmk ] info: member_dump_fn: node id:1392640010, uname=new-fw1.mydomain.com state=member processes=0000000000000000 born=4472 seen=4480 addr=r(0) ip(10.0.2.83) version=1.1.10-14.el6
Dec 27 21:55:07 [1541] new-fw2.mydomain.com cib: info: crm_client_new: Connecting 0x2485d00 for uid=0 gid=0 pid=11283 id=09833e1c-00d3-45c4-8ea1-c019f7dc2a4b
Dec 27 21:55:07 [1541] new-fw2.mydomain.com cib: info: cib_process_request: Completed cib_query operation for section 'all': OK (rc=0, origin=local/crm_mon/2, version=0.391.5)
Dec 27 21:55:07 [1541] new-fw2.mydomain.com cib: info: crm_compress_string: Compressed 201733 bytes into 12230 (ratio 16:1) in 62ms
Dec 27 21:55:07 [1541] new-fw2.mydomain.com cib: info: crm_client_destroy: Destroying 0 events

I don't see this directly relating to my problem but who knows...

--
Tracy Reed
_______________________________________________
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems
Re: FW cluster fails at 4am [ In reply to ]
On 2013-12-28 06:13, Tracy Reed wrote:
> On Fri, Dec 27, 2013 at 08:54:17PM PST, Jefferson Ogata spake thusly:
>> Log rotation tends to run around that time on Red Hat. Check your logrotate
>> configuration. Maybe something is rotating corosync logs and using the wrong
>> signal to start a new log file.
>
> That was actually the first thing I looked at! I found
> /etc/logrotate.d/shorewall and removed it. But that seems to have had no effect
> on the problem. That file has been gone for 3 weeks, the machines rebooted (not
> that it should matter), and the problem has happened several times since then.
>
> I've searched all over and can't find anything. And it doesn't even happen
> every morning, just every week or two. Hard to nail down a real pattern other
> than usually (not always) 4am.

Is it possible that it's a coincidence of log rotation after patching?
In certain circumstances i've had library replacement or subsequent
prelink activity on libraries lead to a crash of some services during
log rotation. This hasn't happened to me with pacemaker/cman/corosync,
but it might conceivably explain why it only happens to you once in a while.

You might take a look at the pacct data in /var/account/ for the time of
the crash; it should indicate exit status for the dying process as well
as what other processes were started around the same time.

>> Or, if not that, could it be some other cronned task?
>
> These firewall machines are standard CentOS boxes. The stock crons (logrotate
> etc) and a 5 minute nagios passive check are the only things on them as far as
> I can tell. Although I haven't quite figured out what causes logrotate to run
> at 4am. I know it is in the /etc/cron.daily/logrotate but what runs this at
> 4am? Is 4am some special hard-coded time in crond?
>
> I just noticed that there is an /etc/logrotate.d/cman which rotates
> /var/log/cluster/*log Could this somehow be an issue? I'm running pacemaker and
> corosync but I'm not running cman:
>
> # /etc/init.d/cman status
> cman is not running
>
> Should I be? I don't think it is necessary for this particular kind of
> cluster... But since it isn't running it shouldn't matter.

Yes, you're supposed to switch to cman. Not sure if it's related to your
problem, tho.
_______________________________________________
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems
Re: FW cluster fails at 4am [ In reply to ]
On Sat, Dec 28, 2013 at 12:42:28AM PST, Jefferson Ogata spake thusly:
> Is it possible that it's a coincidence of log rotation after patching? In
> certain circumstances i've had library replacement or subsequent prelink
> activity on libraries lead to a crash of some services during log rotation.
> This hasn't happened to me with pacemaker/cman/corosync, but it might
> conceivably explain why it only happens to you once in a while.

I just caught the cluster in the middle of crashing again and noticed it had a
system load of 9. Although it isn't clear why. A backup was running but after
the cluster failed over the backup continues and the load went to very nearly
zero. So it doesn't seem like the backup was causing the issue. But the system
was noticeably performance impacted. I've never noticed this situation before.

One thing I really need to learn more about is how the cluster knows when
something has failed and it needs to fail over. I first setup a linux-ha
firewall cluster back around 2001 and we used simple heartbeat and some scripts
to pass around IP addresses and start/stop the firewall. It would ping its
upstream gateway and communicate with its partner via a serial cable. If the
active node couldn't ping its upstream it killed the local heartbeat and the
partner took over. If the active node wasn't sending heartbeats the passive
node took over. Once working it stayed working and was much much simpler than
the current arrangement.

I have no idea how the current system actually communicates or what the
criteria for failover really are.

What are the chances that the box gets overloaded and drops a packet and the
partner takes over?

What if I had an IP conflict with another box on the network and one of my VIP
IP addresses didn't behave as expected?

What would any of these look like in the logs? One of my biggest difficulties
in diagnosing this is that the logs are huge and noisy. It is hard to tell what
is normal, what is an error, and what is the actual test that failed which
caused the failover.

> You might take a look at the pacct data in /var/account/ for the time of the
> crash; it should indicate exit status for the dying process as well as what
> other processes were started around the same time.

Process accounting wasn't running but /var/log/audit/audit.log is which has the
same info. What dying process are we talking about here? I haven't been able to
identify any processes which died.

> Yes, you're supposed to switch to cman. Not sure if it's related to your
> problem, tho.

I suspect the cman issue is unrelated so I'm not going to mess with it until I
get the current issue figure out. I've had two more crashes since I started
this thread: One around 3am and one just this afternoon around 1pm. A backup
was running but after the cluster failed over the backup kept running and the
load returned to normal (practically zero).

--
Tracy Reed
Re: FW cluster fails at 4am [ In reply to ]
On 28 Dec 2013, at 3:34 pm, Tracy Reed <treed@ultraviolet.org> wrote:

> Hello all,
>
> First, thanks in advance for any help anyone may provide. I've been battling
> this problem off and on for months and it is driving me mad:
>
> Once every week or two my cluster fails. For reasons unknown it seems to
> initiate a failover and then the shorewall service (lsb) does not get started
> (or is stopped). The majority of the time it happens just after 4am. Although
> it has happened at other times, although much less frequently. Tonight I am
> going to have to be up at 4am to poke around on the cluster and observe what is
> happening, if anything.
>
> One theory is some sort of resource starvation such as CPU but I've stress
> tested it and run a backup and a big file copy through the firewall at the same
> time and never get more than 1 core of cpu (almost all due to the backup) out
> of 4 utilized and nothing interesting happening to pacemaker/resources.
>
> My setup is a bit complicated in that I have 63 IPaddr2 resources plus the
> shorewall resource. Plus order and colocation rules to make sure it all sticks
> together the IPs come up before shorewall.
>
> I am running the latest RHEL/CentOS RPMs in CentOS 6.5:
>
> [root@new-fw1 shorewall]# rpm -qa |grep -i corosync
> corosync-1.4.1-17.el6.x86_64
> corosynclib-1.4.1-17.el6.x86_64
> [root@new-fw1 shorewall]# rpm -qa |grep -i pacemaker
> pacemaker-1.1.10-14.el6.x86_64
> pacemaker-libs-1.1.10-14.el6.x86_64
> pacemaker-cluster-libs-1.1.10-14.el6.x86_64
> pacemaker-cli-1.1.10-14.el6.x86_64
>
> I am a little concerned about how pacemaker manages the shorewall resource. It
> usually fails to bring up shorewall after a failover event. Shorewall could
> fail to start if the IP addresses shorewall is expecting to be on the
> interfaces are not there yet. But I have dependencies to prevent this from ever
> happening such as:
>
> order shorewall-after-dmz-gw inf: dmz-gw shorewall
>
> I also wonder if the shorewall init script is properly LSB compatible. It
> wasn't out of the box and I had to make a minor change. But now it does seem to
> be LSB compatible:
>
> [root@new-fw2 ~]# /etc/init.d/shorewall status ; echo "result: $?"
> Shorewall-4.5.0.1 Status at new-fw2.mydomain.com - Fri Dec 27 16:57:14 PST 2013
>
> Shorewall is running
> State:Started (Fri Dec 27 04:11:14 PST 2013) from /etc/shorewall/
>
> result: 0
> [root@new-fw2 ~]# /etc/init.d/shorewall stop ; echo "result: $?"
> Shutting down shorewall: [ OK ]
> result: 0
> [root@new-fw2 ~]# /etc/init.d/shorewall status ; echo "result: $?"
> Shorewall-4.5.0.1 Status at new-fw2.mydomain.com - Fri Dec 27 16:57:48 PST 2013
>
> Shorewall is stopped
> State:Stopped (Fri Dec 27 16:57:47 PST 2013)
>
> result: 3
> [root@new-fw2 ~]# /etc/init.d/shorewall start ; echo "result: $?"
> Starting shorewall: Shorewall is already running
> [ OK ]
> result: 0
> [root@new-fw2 ~]# /etc/init.d/shorewall status ; echo "result: $?"
> Shorewall-4.5.0.1 Status at new-fw2.mydomain.com - Fri Dec 27 16:58:04 PST 2013
>
> Shorewall is running
> State:Started (Fri Dec 27 16:57:53 PST 2013) from /etc/shorewall/
>
> result: 0
>
> So it shouldn't be an LSB issue at this point...
>
> I have a very hard time making heads or tails of the
> /var/log/cluster/corosync.log log files. For example, I just had this appear in the log files:
>
> Dec 27 19:56:31 [1551] new-fw1.mydomain.com pengine: info: LogActions: Leave spider2-eth0-40 (Started new-fw2.mydomain.com)
> Dec 27 19:56:31 [1551] new-fw1.mydomain.com pengine: info: LogActions: Leave spider2-eth0-41 (Started new-fw2.mydomain.com)
> Dec 27 19:56:31 [1551] new-fw1.mydomain.com pengine: info: LogActions: Leave corpsites (Started new-fw2.mydomain.com)
> Dec 27 19:56:31 [1551] new-fw1.mydomain.com pengine: info: LogActions: Leave dbrw (Started new-fw2.mydomain.com)
> Dec 27 19:56:31 [1551] new-fw1.mydomain.com pengine: info: LogActions: Leave mjhdev (Started new-fw2.mydomain.com)
> Dec 27 19:56:31 [1551] new-fw1.mydomain.com pengine: info: LogActions: Leave datapass1-ssl-eth0-2 (Started new-fw2.mydomain.com)
> Dec 27 19:56:31 [1551] new-fw1.mydomain.com pengine: info: LogActions: Leave datapass2-ssl-eth0-2 (Started new-fw2.mydomain.com)
> Dec 27 19:56:31 [1551] new-fw1.mydomain.com pengine: info: LogActions: Leave datapass2-ssl-eth0-1 (Started new-fw2.mydomain.com)
> Dec 27 19:56:31 [1551] new-fw1.mydomain.com pengine: info: LogActions: Leave datapass2-ssl-eth0 (Started new-fw2.mydomain.com)
> Dec 27 19:56:31 [1551] new-fw1.mydomain.com pengine: info: LogActions: Leave rrdev2 (Started new-fw2.mydomain.com)
> Dec 27 19:56:31 [1551] new-fw1.mydomain.com pengine: info: LogActions: Leave webmail-resumepromotion (Started new-fw2.mydomain.com)
> Dec 27 19:56:31 [1553] new-fw1.mydomain.com crmd: info: do_state_transition: State transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS cause=C_IPC_MESSAGE origin=handle_response ]
> Dec 27 19:56:31 [1553] new-fw1.mydomain.com crmd: info: do_te_invoke: Processing graph 599 (ref=pe_calc-dc-1388202991-1545) derived from /var/lib/pacemaker/pengine/pe-input-894.bz2
> Dec 27 19:56:31 [1553] new-fw1.mydomain.com crmd: notice: run_graph: Transition 599 (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-894.bz2): Complete
> Dec 27 19:56:31 [1553] new-fw1.mydomain.com crmd: info: do_log: FSA: Input I_TE_SUCCESS from notify_crmd() received in state S_TRANSITION_ENGINE
> Dec 27 19:56:31 [1553] new-fw1.mydomain.com crmd: notice: do_state_transition: State transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL origin=notify_crmd ]
> Dec 27 19:56:31 [1551] new-fw1.mydomain.com pengine: notice: process_pe_message: Calculated Transition 599: /var/lib/pacemaker/pengine/pe-input-894.bz2
>
> Is this normal? What does "Leave" mean?

It means the resource will be left in its current state (running) on new-fw2.mydomain.com

> This is on the inactive node. The
> active node doesn't seem to be making any log entries at all.
>
> My corosync.conf and full crm config are as follows. There has also been
> concern expressed about the size of our crm config. Is this unreasonable?

Its certainly large and will definitely benefit from the CIB performance improvements coming in 1.1.12, but it shouldn't be a deal breaker.
Consider though, what effect 63 IPaddr monitor operations running at the same time might have on your system.

> I am
> going to attempt to attach the 120k bzipped logfile to this email but odds are
> the list won't take attachments. If it doesn't make it and anyone wants to see
> it I will figure out somewhere else to stash it.
>
> The logs are so voluminous that it is hard to know which parts are necessary.

I'd start here:

grep crmd: /Users/beekhof/Downloads/fw1.corosync.log | grep process_lrm_event | grep error | wc -l
45

So it seems that at starting at Dec 27 04:02:25, 45 IP address resources failed.

Dec 27 04:02:25 [1553] new-fw1.edirectpublishing.com crmd: error: process_lrm_event: LRM operation spider2-eth0-41_monitor_30000 (2600) Timed Out (timeout=20000ms)
Dec 27 04:02:25 [1553] new-fw1.edirectpublishing.com crmd: error: process_lrm_event: LRM operation spider2-eth0-33_monitor_30000 (2564) Timed Out (timeout=20000ms)

The really bad part, is that stop operations also failed:

Dec 27 04:02:53 [1553] new-fw1.edirectpublishing.com crmd: notice: process_lrm_event: LRM operation actonomy_stop_0 (call=3080, rc=1, cib-update=2001, confirmed=true) unknown error
Dec 27 04:02:53 [1553] new-fw1.edirectpublishing.com crmd: notice: process_lrm_event: LRM operation webmail_stop_0 (call=3062, rc=1, cib-update=2003, confirmed=true) unknown error

Which, when fencing is disabled, blocks further recovery of the cluster.

>
> corosync.conf:
>
> # Please read the corosync.conf.5 manual page
> compatibility: whitetank
>
> totem {
> version: 2
> secauth: off
> threads: 0
> interface {
> ringnumber: 0
> bindnetaddr: 10.0.2.0
> mcastaddr: 226.94.1.1
> mcastport: 5000
> member {
> memberaddr: 10.0.2.83
> }
> member {
> memberaddr: 10.0.2.84
> }
> }
> transport udpu
> }
>
> logging {
> fileline: off
> to_stderr: no
> to_logfile: yes
> to_syslog: yes
> logfile: /var/log/cluster/corosync.log
> debug: off
> timestamp: on
> logger_subsys {
> subsys: AMF
> debug: off
> }
> }
>
> amf {
> mode: disabled
> }
>
>
> crm configure show:
>
> node new-fw1.mydomain.com \
> attributes standby="off"
> node new-fw2.mydomain.com \
> attributes standby="off"
> primitive actonomy ocf:heartbeat:IPaddr2 \
> params ip="206.71.189.165" nic="eth1" cidr_netmask="25" \
> op monitor interval="30s" \
> meta is-managed="true"
> primitive corpsites ocf:heartbeat:IPaddr2 \
> params ip="206.71.189.150" nic="eth1" cidr_netmask="25" \
> op monitor interval="30s" \
> meta target-role="Started"
> primitive datapass1-ssl-eth0-2 ocf:heartbeat:IPaddr2 \
> params ip="206.71.189.166" nic="eth1" cidr_netmask="25" \
> op monitor interval="30s"
> primitive datapass2-ssl-eth0 ocf:heartbeat:IPaddr2 \
> params ip="206.71.189.161" nic="eth1" cidr_netmask="25" \
> op monitor interval="30s"
> primitive datapass2-ssl-eth0-1 ocf:heartbeat:IPaddr2 \
> params ip="206.71.189.162" nic="eth1" cidr_netmask="25" \
> op monitor interval="30s"
> primitive datapass2-ssl-eth0-2 ocf:heartbeat:IPaddr2 \
> params ip="206.71.189.167" nic="eth1" cidr_netmask="32" \
> op monitor interval="30s"
> primitive dbrw ocf:heartbeat:IPaddr2 \
> params ip="206.71.189.250" nic="eth1" cidr_netmask="32" \
> op monitor interval="30s" \
> meta target-role="Started"
> primitive dmz-gw ocf:heartbeat:IPaddr2 \
> params ip="10.0.2.254" cidr_netmask="32" \
> op monitor interval="30s"
> primitive mjhdev ocf:heartbeat:IPaddr2 \
> params ip="209.216.236.138" nic="eth1" cidr_netmask="32" \
> op monitor interval="30s"
> primitive mjhwiki ocf:heartbeat:IPaddr2 \
> params ip="206.71.189.163" nic="eth1" cidr_netmask="32" \
> op monitor interval="30s"
> primitive mx1 ocf:heartbeat:IPaddr2 \
> params ip="216.240.176.65" nic="eth2" cidr_netmask="32" \
> op monitor interval="30s"
> primitive pin1 ocf:heartbeat:IPaddr2 \
> params ip="206.71.189.147" nic="eth1" cidr_netmask="32" \
> op monitor interval="30s"
> primitive pin2 ocf:heartbeat:IPaddr2 \
> params ip="206.71.189.146" nic="eth1" cidr_netmask="32" \
> op monitor interval="30s"
> primitive pin3 ocf:heartbeat:IPaddr2 \
> params ip="206.71.189.145" nic="eth1" cidr_netmask="32" \
> op monitor interval="30s"
> primitive rabbit ocf:heartbeat:IPaddr2 \
> params ip="206.71.189.149" nic="eth1" cidr_netmask="32" \
> op monitor interval="30s"
> primitive reza ocf:heartbeat:IPaddr2 \
> params ip="216.240.176.66" nic="eth2" cidr_netmask="32" \
> op monitor interval="30s"
> primitive rrdev ocf:heartbeat:IPaddr2 \
> params ip="206.71.189.164" nic="eth1" cidr_netmask="32" \
> op monitor interval="30s"
> primitive rrdev2 ocf:heartbeat:IPaddr2 \
> params ip="206.71.189.240" nic="eth1" cidr_netmask="32" \
> op monitor interval="30s"
> primitive shorewall lsb:shorewall \
> op monitor interval="60s" \
> meta target-role="Started" is-managed="true"
> primitive spider2-eth0-1 ocf:heartbeat:IPaddr2 \
> params ip="216.240.176.67" nic="eth2" cidr_netmask="32" \
> op monitor interval="30s"
> primitive spider2-eth0-10 ocf:heartbeat:IPaddr2 \
> params ip="216.240.176.76" nic="eth2" cidr_netmask="32" \
> op monitor interval="30s"
> primitive spider2-eth0-11 ocf:heartbeat:IPaddr2 \
> params ip="216.240.176.77" nic="eth2" cidr_netmask="32" \
> op monitor interval="30s"
> primitive spider2-eth0-12 ocf:heartbeat:IPaddr2 \
> params ip="216.240.176.78" nic="eth2" cidr_netmask="32" \
> op monitor interval="30s"
> primitive spider2-eth0-13 ocf:heartbeat:IPaddr2 \
> params ip="216.240.176.79" nic="eth2" cidr_netmask="32" \
> op monitor interval="30s"
> primitive spider2-eth0-14 ocf:heartbeat:IPaddr2 \
> params ip="216.240.176.80" nic="eth2" cidr_netmask="32" \
> op monitor interval="30s"
> primitive spider2-eth0-15 ocf:heartbeat:IPaddr2 \
> params ip="216.240.176.81" nic="eth2" cidr_netmask="32" \
> op monitor interval="30s"
> primitive spider2-eth0-16 ocf:heartbeat:IPaddr2 \
> params ip="216.240.176.82" nic="eth2" cidr_netmask="32" \
> op monitor interval="30s"
> primitive spider2-eth0-17 ocf:heartbeat:IPaddr2 \
> params ip="216.240.176.83" nic="eth2" cidr_netmask="32" \
> op monitor interval="30s"
> primitive spider2-eth0-18 ocf:heartbeat:IPaddr2 \
> params ip="216.240.176.84" nic="eth2" cidr_netmask="32" \
> op monitor interval="30s"
> primitive spider2-eth0-19 ocf:heartbeat:IPaddr2 \
> params ip="216.240.176.85" nic="eth2" cidr_netmask="32" \
> op monitor interval="30s"
> primitive spider2-eth0-2 ocf:heartbeat:IPaddr2 \
> params ip="216.240.176.68" nic="eth2" cidr_netmask="32" \
> op monitor interval="30s"
> primitive spider2-eth0-20 ocf:heartbeat:IPaddr2 \
> params ip="216.240.176.86" nic="eth2" cidr_netmask="32" \
> op monitor interval="30s"
> primitive spider2-eth0-21 ocf:heartbeat:IPaddr2 \
> params ip="216.240.176.87" nic="eth2" cidr_netmask="32" \
> op monitor interval="30s"
> primitive spider2-eth0-22 ocf:heartbeat:IPaddr2 \
> params ip="216.240.176.88" nic="eth2" cidr_netmask="32" \
> op monitor interval="30s"
> primitive spider2-eth0-23 ocf:heartbeat:IPaddr2 \
> params ip="216.240.176.89" nic="eth2" cidr_netmask="32" \
> op monitor interval="30s"
> primitive spider2-eth0-24 ocf:heartbeat:IPaddr2 \
> params ip="216.240.176.90" nic="eth2" cidr_netmask="32" \
> op monitor interval="30s"
> primitive spider2-eth0-25 ocf:heartbeat:IPaddr2 \
> params ip="216.240.176.91" nic="eth2" cidr_netmask="32" \
> op monitor interval="30s"
> primitive spider2-eth0-26 ocf:heartbeat:IPaddr2 \
> params ip="216.240.176.92" nic="eth2" cidr_netmask="32" \
> op monitor interval="30s"
> primitive spider2-eth0-27 ocf:heartbeat:IPaddr2 \
> params ip="216.240.176.93" nic="eth2" cidr_netmask="32" \
> op monitor interval="30s"
> primitive spider2-eth0-28 ocf:heartbeat:IPaddr2 \
> params ip="216.240.176.94" nic="eth2" cidr_netmask="32" \
> op monitor interval="30s"
> primitive spider2-eth0-29 ocf:heartbeat:IPaddr2 \
> params ip="216.240.176.95" nic="eth2" cidr_netmask="32" \
> op monitor interval="30s"
> primitive spider2-eth0-3 ocf:heartbeat:IPaddr2 \
> params ip="216.240.176.69" nic="eth2" cidr_netmask="32" \
> op monitor interval="30s"
> primitive spider2-eth0-30 ocf:heartbeat:IPaddr2 \
> params ip="216.240.176.96" nic="eth2" cidr_netmask="32" \
> op monitor interval="30s"
> primitive spider2-eth0-31 ocf:heartbeat:IPaddr2 \
> params ip="216.240.176.97" nic="eth2" cidr_netmask="32" \
> op monitor interval="30s"
> primitive spider2-eth0-32 ocf:heartbeat:IPaddr2 \
> params ip="216.240.176.98" nic="eth2" cidr_netmask="32" \
> op monitor interval="30s"
> primitive spider2-eth0-33 ocf:heartbeat:IPaddr2 \
> params ip="216.240.176.99" nic="eth2" cidr_netmask="32" \
> op monitor interval="30s"
> primitive spider2-eth0-34 ocf:heartbeat:IPaddr2 \
> params ip="216.240.176.100" nic="eth2" cidr_netmask="32" \
> op monitor interval="30s"
> primitive spider2-eth0-35 ocf:heartbeat:IPaddr2 \
> params ip="216.240.176.101" nic="eth2" cidr_netmask="32" \
> op monitor interval="30s"
> primitive spider2-eth0-36 ocf:heartbeat:IPaddr2 \
> params ip="216.240.176.102" nic="eth2" cidr_netmask="32" \
> op monitor interval="30s"
> primitive spider2-eth0-37 ocf:heartbeat:IPaddr2 \
> params ip="216.240.176.103" nic="eth2" cidr_netmask="32" \
> op monitor interval="30s"
> primitive spider2-eth0-38 ocf:heartbeat:IPaddr2 \
> params ip="216.240.176.104" nic="eth2" cidr_netmask="32" \
> op monitor interval="30s"
> primitive spider2-eth0-39 ocf:heartbeat:IPaddr2 \
> params ip="216.240.176.105" nic="eth2" cidr_netmask="32" \
> op monitor interval="30s"
> primitive spider2-eth0-4 ocf:heartbeat:IPaddr2 \
> params ip="216.240.176.70" nic="eth2" cidr_netmask="32" \
> op monitor interval="30s"
> primitive spider2-eth0-40 ocf:heartbeat:IPaddr2 \
> params ip="216.240.176.106" nic="eth2" cidr_netmask="32" \
> op monitor interval="30s"
> primitive spider2-eth0-41 ocf:heartbeat:IPaddr2 \
> params ip="216.240.176.107" nic="eth2" cidr_netmask="32" \
> op monitor interval="30s"
> primitive spider2-eth0-5 ocf:heartbeat:IPaddr2 \
> params ip="216.240.176.71" nic="eth2" cidr_netmask="32" \
> op monitor interval="30s"
> primitive spider2-eth0-6 ocf:heartbeat:IPaddr2 \
> params ip="216.240.176.72" nic="eth2" cidr_netmask="32" \
> op monitor interval="30s"
> primitive spider2-eth0-7 ocf:heartbeat:IPaddr2 \
> params ip="216.240.176.73" nic="eth2" cidr_netmask="32" \
> op monitor interval="30s"
> primitive spider2-eth0-8 ocf:heartbeat:IPaddr2 \
> params ip="216.240.176.74" nic="eth2" cidr_netmask="32" \
> op monitor interval="30s"
> primitive spider2-eth0-9 ocf:heartbeat:IPaddr2 \
> params ip="216.240.176.75" nic="eth2" cidr_netmask="32" \
> op monitor interval="30s"
> primitive sugar ocf:heartbeat:IPaddr2 \
> params ip="206.71.189.130" nic="eth1" cidr_netmask="25" \
> op monitor interval="30s"
> primitive webmail ocf:heartbeat:IPaddr2 \
> params ip="206.71.189.254" nic="eth1" cidr_netmask="25" \
> op monitor interval="30s"
> primitive webmail-resumepromotion ocf:heartbeat:IPaddr2 \
> params ip="206.71.189.242" nic="eth1" cidr_netmask="25" \
> op monitor interval="30s"
> primitive wiki ocf:heartbeat:IPaddr2 \
> params ip="209.216.236.135" nic="eth1" cidr_netmask="32" \
> op monitor interval="30s"
> colocation dmz-gw-with-actonomy inf: dmz-gw actonomy
> colocation dmz-gw-with-mjhwiki inf: dmz-gw mjhwiki
> colocation dmz-gw-with-mx1 inf: dmz-gw mx1
> colocation dmz-gw-with-pin1 inf: dmz-gw pin1
> colocation dmz-gw-with-pin2 inf: dmz-gw pin2
> colocation dmz-gw-with-pin3 inf: dmz-gw pin3
> colocation dmz-gw-with-rabbit inf: dmz-gw rabbit
> colocation dmz-gw-with-reza inf: dmz-gw reza
> colocation dmz-gw-with-rrdev inf: dmz-gw rrdev
> colocation dmz-gw-with-spider2-eth0-1 inf: dmz-gw spider2-eth0-1
> colocation dmz-gw-with-spider2-eth0-10 inf: dmz-gw spider2-eth0-10
> colocation dmz-gw-with-spider2-eth0-11 inf: dmz-gw spider2-eth0-11
> colocation dmz-gw-with-spider2-eth0-12 inf: dmz-gw spider2-eth0-12
> colocation dmz-gw-with-spider2-eth0-13 inf: dmz-gw spider2-eth0-13
> colocation dmz-gw-with-spider2-eth0-14 inf: dmz-gw spider2-eth0-14
> colocation dmz-gw-with-spider2-eth0-15 inf: dmz-gw spider2-eth0-15
> colocation dmz-gw-with-spider2-eth0-16 inf: dmz-gw spider2-eth0-16
> colocation dmz-gw-with-spider2-eth0-17 inf: dmz-gw spider2-eth0-17
> colocation dmz-gw-with-spider2-eth0-18 inf: dmz-gw spider2-eth0-18
> colocation dmz-gw-with-spider2-eth0-19 inf: dmz-gw spider2-eth0-19
> colocation dmz-gw-with-spider2-eth0-2 inf: dmz-gw spider2-eth0-2
> colocation dmz-gw-with-spider2-eth0-20 inf: dmz-gw spider2-eth0-20
> colocation dmz-gw-with-spider2-eth0-21 inf: dmz-gw spider2-eth0-21
> colocation dmz-gw-with-spider2-eth0-22 inf: dmz-gw spider2-eth0-22
> colocation dmz-gw-with-spider2-eth0-23 inf: dmz-gw spider2-eth0-23
> colocation dmz-gw-with-spider2-eth0-24 inf: dmz-gw spider2-eth0-24
> colocation dmz-gw-with-spider2-eth0-25 inf: dmz-gw spider2-eth0-25
> colocation dmz-gw-with-spider2-eth0-26 inf: dmz-gw spider2-eth0-26
> colocation dmz-gw-with-spider2-eth0-27 inf: dmz-gw spider2-eth0-27
> colocation dmz-gw-with-spider2-eth0-28 inf: dmz-gw spider2-eth0-28
> colocation dmz-gw-with-spider2-eth0-29 inf: dmz-gw spider2-eth0-29
> colocation dmz-gw-with-spider2-eth0-3 inf: dmz-gw spider2-eth0-3
> colocation dmz-gw-with-spider2-eth0-30 inf: dmz-gw spider2-eth0-30
> colocation dmz-gw-with-spider2-eth0-31 inf: dmz-gw spider2-eth0-31
> colocation dmz-gw-with-spider2-eth0-32 inf: dmz-gw spider2-eth0-32
> colocation dmz-gw-with-spider2-eth0-33 inf: dmz-gw spider2-eth0-33
> colocation dmz-gw-with-spider2-eth0-34 inf: dmz-gw spider2-eth0-34
> colocation dmz-gw-with-spider2-eth0-35 inf: dmz-gw spider2-eth0-35
> colocation dmz-gw-with-spider2-eth0-36 inf: dmz-gw spider2-eth0-36
> colocation dmz-gw-with-spider2-eth0-37 inf: dmz-gw spider2-eth0-37
> colocation dmz-gw-with-spider2-eth0-38 inf: dmz-gw spider2-eth0-38
> colocation dmz-gw-with-spider2-eth0-39 inf: dmz-gw spider2-eth0-39
> colocation dmz-gw-with-spider2-eth0-4 inf: dmz-gw spider2-eth0-4
> colocation dmz-gw-with-spider2-eth0-40 inf: dmz-gw spider2-eth0-40
> colocation dmz-gw-with-spider2-eth0-41 inf: dmz-gw spider2-eth0-41
> colocation dmz-gw-with-spider2-eth0-5 inf: dmz-gw spider2-eth0-5
> colocation dmz-gw-with-spider2-eth0-6 inf: dmz-gw spider2-eth0-6
> colocation dmz-gw-with-spider2-eth0-7 inf: dmz-gw spider2-eth0-7
> colocation dmz-gw-with-spider2-eth0-8 inf: dmz-gw spider2-eth0-8
> colocation dmz-gw-with-spider2-eth0-9 inf: dmz-gw spider2-eth0-9
> colocation dmz-gw-with-sugar inf: dmz-gw sugar
> colocation dmz-gw-with-webmail inf: dmz-gw webmail
> colocation dmz-gw-with-wiki inf: dmz-gw wiki
> colocation shorewall-with-actonomy inf: shorewall actonomy
> colocation shorewall-with-corpsites inf: shorewall corpsites
> colocation shorewall-with-datapass1-ssl-eth0-2 inf: shorewall datapass1-ssl-eth0-2
> colocation shorewall-with-datapass2-ssl-eth0 inf: shorewall datapass2-ssl-eth0
> colocation shorewall-with-datapass2-ssl-eth0-1 inf: shorewall datapass2-ssl-eth0-1
> colocation shorewall-with-datapass2-ssl-eth0-2 inf: shorewall datapass2-ssl-eth0-2
> colocation shorewall-with-dbrw inf: shorewall dbrw
> colocation shorewall-with-dmz-gw inf: shorewall dmz-gw
> colocation shorewall-with-mjhdev inf: shorewall mjhdev
> colocation shorewall-with-mjhwiki inf: shorewall mjhwiki
> colocation shorewall-with-mx1 inf: shorewall mx1
> colocation shorewall-with-pin1 inf: shorewall pin1
> colocation shorewall-with-pin2 inf: shorewall pin2
> colocation shorewall-with-pin3 inf: shorewall pin3
> colocation shorewall-with-rabbit inf: shorewall rabbit
> colocation shorewall-with-reza inf: shorewall reza
> colocation shorewall-with-rrdev inf: shorewall rrdev
> colocation shorewall-with-rrdev2 inf: shorewall rrdev2
> colocation shorewall-with-spider2-eth0-1 inf: shorewall spider2-eth0-1
> colocation shorewall-with-spider2-eth0-10 inf: shorewall spider2-eth0-10
> colocation shorewall-with-spider2-eth0-11 inf: shorewall spider2-eth0-11
> colocation shorewall-with-spider2-eth0-12 inf: shorewall spider2-eth0-12
> colocation shorewall-with-spider2-eth0-13 inf: shorewall spider2-eth0-13
> colocation shorewall-with-spider2-eth0-14 inf: shorewall spider2-eth0-14
> colocation shorewall-with-spider2-eth0-15 inf: shorewall spider2-eth0-15
> colocation shorewall-with-spider2-eth0-16 inf: shorewall spider2-eth0-16
> colocation shorewall-with-spider2-eth0-17 inf: shorewall spider2-eth0-17
> colocation shorewall-with-spider2-eth0-18 inf: shorewall spider2-eth0-18
> colocation shorewall-with-spider2-eth0-19 inf: shorewall spider2-eth0-19
> colocation shorewall-with-spider2-eth0-2 inf: shorewall spider2-eth0-2
> colocation shorewall-with-spider2-eth0-20 inf: shorewall spider2-eth0-20
> colocation shorewall-with-spider2-eth0-21 inf: shorewall spider2-eth0-21
> colocation shorewall-with-spider2-eth0-22 inf: shorewall spider2-eth0-22
> colocation shorewall-with-spider2-eth0-23 inf: shorewall spider2-eth0-23
> colocation shorewall-with-spider2-eth0-24 inf: shorewall spider2-eth0-24
> colocation shorewall-with-spider2-eth0-25 inf: shorewall spider2-eth0-25
> colocation shorewall-with-spider2-eth0-26 inf: shorewall spider2-eth0-26
> colocation shorewall-with-spider2-eth0-27 inf: shorewall spider2-eth0-27
> colocation shorewall-with-spider2-eth0-28 inf: shorewall spider2-eth0-28
> colocation shorewall-with-spider2-eth0-29 inf: shorewall spider2-eth0-29
> colocation shorewall-with-spider2-eth0-3 inf: shorewall spider2-eth0-3
> colocation shorewall-with-spider2-eth0-30 inf: shorewall spider2-eth0-30
> colocation shorewall-with-spider2-eth0-31 inf: shorewall spider2-eth0-31
> colocation shorewall-with-spider2-eth0-32 inf: shorewall spider2-eth0-32
> colocation shorewall-with-spider2-eth0-33 inf: shorewall spider2-eth0-33
> colocation shorewall-with-spider2-eth0-34 inf: shorewall spider2-eth0-34
> colocation shorewall-with-spider2-eth0-35 inf: shorewall spider2-eth0-35
> colocation shorewall-with-spider2-eth0-36 inf: shorewall spider2-eth0-36
> colocation shorewall-with-spider2-eth0-37 inf: shorewall spider2-eth0-37
> colocation shorewall-with-spider2-eth0-38 inf: shorewall spider2-eth0-38
> colocation shorewall-with-spider2-eth0-39 inf: shorewall spider2-eth0-39
> colocation shorewall-with-spider2-eth0-4 inf: shorewall spider2-eth0-4
> colocation shorewall-with-spider2-eth0-40 inf: shorewall spider2-eth0-40
> colocation shorewall-with-spider2-eth0-41 inf: shorewall spider2-eth0-41
> colocation shorewall-with-spider2-eth0-5 inf: shorewall spider2-eth0-5
> colocation shorewall-with-spider2-eth0-6 inf: shorewall spider2-eth0-6
> colocation shorewall-with-spider2-eth0-7 inf: shorewall spider2-eth0-7
> colocation shorewall-with-spider2-eth0-8 inf: shorewall spider2-eth0-8
> colocation shorewall-with-spider2-eth0-9 inf: shorewall spider2-eth0-9
> colocation shorewall-with-sugar inf: shorewall sugar
> colocation shorewall-with-webmail inf: shorewall webmail
> colocation shorewall-with-webmail-resumepromotion inf: shorewall webmail-resumepromotion
> colocation shorewall-with-wiki inf: shorewall wiki
> order shorewall-after-actonomy inf: actonomy shorewall
> order shorewall-after-corpsites inf: corpsites shorewall
> order shorewall-after-datapass1-ssl-eth0-2 inf: datapass1-ssl-eth0-2 shorewall
> order shorewall-after-datapass2-ssl-eth0 inf: datapass2-ssl-eth0 shorewall
> order shorewall-after-datapass2-ssl-eth0-1 inf: datapass2-ssl-eth0-1 shorewall
> order shorewall-after-datapass2-ssl-eth0-2 inf: datapass2-ssl-eth0-2 shorewall
> order shorewall-after-dbrw inf: sugar dbrw
> order shorewall-after-dmz-gw inf: dmz-gw shorewall
> order shorewall-after-mjhdev inf: mjhdev shorewall
> order shorewall-after-mjhwiki inf: mjhwiki shorewall
> order shorewall-after-mx1 inf: mx1 shorewall
> order shorewall-after-pin1 inf: pin1 shorewall
> order shorewall-after-pin2 inf: pin2 shorewall
> order shorewall-after-pin3 inf: pin3 shorewall
> order shorewall-after-rabbit inf: rabbit shorewall
> order shorewall-after-reza inf: reza shorewall
> order shorewall-after-rrdev inf: rrdev shorewall
> order shorewall-after-rrdev2 inf: rrdev2 shorewall
> order shorewall-after-spider2-eth0-1 inf: spider2-eth0-1 shorewall
> order shorewall-after-spider2-eth0-10 inf: spider2-eth0-10 shorewall
> order shorewall-after-spider2-eth0-11 inf: spider2-eth0-11 shorewall
> order shorewall-after-spider2-eth0-12 inf: spider2-eth0-12 shorewall
> order shorewall-after-spider2-eth0-13 inf: spider2-eth0-13 shorewall
> order shorewall-after-spider2-eth0-14 inf: spider2-eth0-14 shorewall
> order shorewall-after-spider2-eth0-15 inf: spider2-eth0-15 shorewall
> order shorewall-after-spider2-eth0-16 inf: spider2-eth0-16 shorewall
> order shorewall-after-spider2-eth0-17 inf: spider2-eth0-17 shorewall
> order shorewall-after-spider2-eth0-18 inf: spider2-eth0-18 shorewall
> order shorewall-after-spider2-eth0-19 inf: spider2-eth0-19 shorewall
> order shorewall-after-spider2-eth0-2 inf: spider2-eth0-2 shorewall
> order shorewall-after-spider2-eth0-20 inf: spider2-eth0-20 shorewall
> order shorewall-after-spider2-eth0-21 inf: spider2-eth0-21 shorewall
> order shorewall-after-spider2-eth0-22 inf: spider2-eth0-22 shorewall
> order shorewall-after-spider2-eth0-23 inf: spider2-eth0-23 shorewall
> order shorewall-after-spider2-eth0-24 inf: spider2-eth0-24 shorewall
> order shorewall-after-spider2-eth0-25 inf: spider2-eth0-25 shorewall
> order shorewall-after-spider2-eth0-26 inf: spider2-eth0-26 shorewall
> order shorewall-after-spider2-eth0-27 inf: spider2-eth0-27 shorewall
> order shorewall-after-spider2-eth0-28 inf: spider2-eth0-28 shorewall
> order shorewall-after-spider2-eth0-29 inf: spider2-eth0-29 shorewall
> order shorewall-after-spider2-eth0-3 inf: spider2-eth0-3 shorewall
> order shorewall-after-spider2-eth0-30 inf: spider2-eth0-30 shorewall
> order shorewall-after-spider2-eth0-31 inf: spider2-eth0-31 shorewall
> order shorewall-after-spider2-eth0-32 inf: spider2-eth0-32 shorewall
> order shorewall-after-spider2-eth0-33 inf: spider2-eth0-33 shorewall
> order shorewall-after-spider2-eth0-34 inf: spider2-eth0-34 shorewall
> order shorewall-after-spider2-eth0-35 inf: spider2-eth0-35 shorewall
> order shorewall-after-spider2-eth0-36 inf: spider2-eth0-36 shorewall
> order shorewall-after-spider2-eth0-37 inf: spider2-eth0-37 shorewall
> order shorewall-after-spider2-eth0-38 inf: spider2-eth0-38 shorewall
> order shorewall-after-spider2-eth0-39 inf: spider2-eth0-39 shorewall
> order shorewall-after-spider2-eth0-4 inf: spider2-eth0-4 shorewall
> order shorewall-after-spider2-eth0-40 inf: spider2-eth0-40 shorewall
> order shorewall-after-spider2-eth0-41 inf: spider2-eth0-41 shorewall
> order shorewall-after-spider2-eth0-5 inf: spider2-eth0-5 shorewall
> order shorewall-after-spider2-eth0-6 inf: spider2-eth0-6 shorewall
> order shorewall-after-spider2-eth0-7 inf: spider2-eth0-7 shorewall
> order shorewall-after-spider2-eth0-8 inf: spider2-eth0-8 shorewall
> order shorewall-after-spider2-eth0-9 inf: spider2-eth0-9 shorewall
> order shorewall-after-sugar inf: sugar shorewall
> order shorewall-after-webmail inf: webmail shorewall
> order shorewall-after-webmail-resumepromotion inf: webmail-resumepromotion shorewall
> order shorewall-after-wiki inf: wiki shorewall
> property $id="cib-bootstrap-options" \
> dc-version="1.1.10-14.el6-368c726" \
> cluster-infrastructure="classic openais (with plugin)" \
> expected-quorum-votes="2" \
> stonith-enabled="false" \
> last-lrm-refresh="1388199388" \
> no-quorum-policy="ignore" \
> maintenance-mode="false"
> rsc_defaults $id="rsc-options" \
> resource-stickiness="100"
>
> --
> Tracy Reed
> <fw1.corosync.log.bz2>_______________________________________________
> Linux-HA mailing list
> Linux-HA@lists.linux-ha.org
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
Re: FW cluster fails at 4am [ In reply to ]
On 7 Jan 2014, at 10:52 am, Tracy Reed <treed@ultraviolet.org> wrote:

> On Sat, Dec 28, 2013 at 12:42:28AM PST, Jefferson Ogata spake thusly:
>> Is it possible that it's a coincidence of log rotation after patching? In
>> certain circumstances i've had library replacement or subsequent prelink
>> activity on libraries lead to a crash of some services during log rotation.
>> This hasn't happened to me with pacemaker/cman/corosync, but it might
>> conceivably explain why it only happens to you once in a while.
>
> I just caught the cluster in the middle of crashing again and noticed it had a
> system load of 9. Although it isn't clear why.

See my other reply:

"Consider though, what effect 63 IPaddr monitor operations running at the same time might have on your system."

> A backup was running but after
> the cluster failed over the backup continues and the load went to very nearly
> zero. So it doesn't seem like the backup was causing the issue. But the system
> was noticeably performance impacted. I've never noticed this situation before.
>
> One thing I really need to learn more about is how the cluster knows when
> something has failed and it needs to fail over.

We ask the resources by calling their script with $action=monitor
For node-level failures, corosync tells us.

> I first setup a linux-ha
> firewall cluster back around 2001 and we used simple heartbeat and some scripts
> to pass around IP addresses and start/stop the firewall. It would ping its
> upstream gateway and communicate with its partner via a serial cable. If the
> active node couldn't ping its upstream it killed the local heartbeat and the
> partner took over. If the active node wasn't sending heartbeats the passive
> node took over. Once working it stayed working and was much much simpler than
> the current arrangement.
>
> I have no idea how the current system actually communicates or what the
> criteria for failover really are.
>
> What are the chances that the box gets overloaded and drops a packet and the
> partner takes over?
>
> What if I had an IP conflict with another box on the network and one of my VIP
> IP addresses didn't behave as expected?
>
> What would any of these look like in the logs? One of my biggest difficulties
> in diagnosing this is that the logs are huge and noisy. It is hard to tell what
> is normal, what is an error, and what is the actual test that failed which
> caused the failover.
>
>> You might take a look at the pacct data in /var/account/ for the time of the
>> crash; it should indicate exit status for the dying process as well as what
>> other processes were started around the same time.
>
> Process accounting wasn't running but /var/log/audit/audit.log is which has the
> same info. What dying process are we talking about here? I haven't been able to
> identify any processes which died.

I think there was an assumption that your resources were long running daemons.

>
>> Yes, you're supposed to switch to cman. Not sure if it's related to your
>> problem, tho.
>
> I suspect the cman issue is unrelated

Quite likely.

> so I'm not going to mess with it until I
> get the current issue figure out. I've had two more crashes since I started
> this thread: One around 3am and one just this afternoon around 1pm. A backup
> was running but after the cluster failed over the backup kept running and the
> load returned to normal (practically zero).
>
> --
> Tracy Reed
> _______________________________________________
> Linux-HA mailing list
> Linux-HA@lists.linux-ha.org
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems