Mailing List Archive: A secondary DRBD is not brought online after "crm resource online" in Ubuntu 14.04

A secondary DRBD is not brought online after "crm resource online" in Ubuntu 14.04

Jan 3, 2015, 9:35 AM

Post #1 of 3 (1362 views)

Hello.

Ubuntu 14.04, corosync 2.3.3, pacemaker 1.1.10. The cluster consists of 2
nodes (node1 and node2), when I run "crm node standby node2" and then, in a
minute, "crm node online node2", DRBD secondary on node2 does not start.
Logs say that "drbdadm -c /etc/drbd.conf check-resize vlv" fails with an
error message: "No valid meta data found" on the onlining node. And,
surprisingly, after I run "service drbd start" on node2 manually,
everything becomes fine.

Maybe something is broken in /usr/lib/ocf/resource.d/linbit/drbd, why
cannot it start DRBD? Or I am misconfigured somehow? Could you please give
an advice what to do?

I have the following configuration (drbd + mount + postgresql, but
postgresql is innocent here, so just ignore it):

*root@node2:/var/log#* crm configure show
node $id="1017525950" node2 attributes standby="off"
node $id="1760315215" node1
primitive drbd ocf:linbit:drbd \
params drbd_resource="vlv" \
op start interval="0" timeout="240" \
op stop interval="0" timeout="120"
primitive fs ocf:heartbeat:Filesystem \
params device="/dev/drbd0" directory="/var/lib/vlv.drbd/root"
options="noatime,nodiratime" fstype="xfs" \
op start interval="0" timeout="300" \
op stop interval="0" timeout="300"
primitive postgresql lsb:postgresql \
op monitor interval="4" timeout="60" \
op start interval="0" timeout="60" \
op stop interval="0" timeout="60"
group pgserver fs postgresql
ms ms_drbd drbd \
meta master-max="1" master-node-max="1" clone-max="2" clone-node-max="1"
notify="true"
location cli-prefer-pgserver pgserver inf: node1
colocation col_pgserver inf: pgserver ms_drbd:Master
order ord_pgserver inf: ms_drbd:promote pgserver:start
property $id="cib-bootstrap-options" dc-version="1.1.10-42f2063"
cluster-infrastructure="corosync"
stonith-enabled="false" no-quorum-policy="ignore"
last-lrm-refresh="1420304078"
rsc_defaults $id="rsc-options" \
resource-stickiness="100"

The cluster and DRBD statuses on node2 look healthy:

*root@node2:/var/log#* crm status
...
Online: [ node1 node2 ]
Master/Slave Set: ms_drbd [drbd]
Masters: [ node1 ]
Slaves: [ node2 ]
Resource Group: pgserver
fs (ocf::heartbeat:Filesystem): Started node1
postgresql (lsb:postgresql): Started node1

*root@node2:/var/log#* cat /proc/drbd
version: 8.4.3 (api:1/proto:86-101)
srcversion: F97798065516C94BE0F27DC
0: cs:Connected ro:Secondary/Primary ds:Diskless/UpToDate C r-----
ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:0

Now I switch node2 to standby and verify that DRBD on it has really shot
down:

*root@node1:/etc/rc2.d#* crm node standby node2
*root@node2:/var/log#* cat /proc/drbd
version: 8.4.3 (api:1/proto:86-101)
srcversion: F97798065516C94BE0F27DC
*root@node2:/var/log#*

Then I switch node2 back online and see that DRBD has not been initialized
and reattached again!

*root@node2:/var/log#* >syslog
*root@node1:/etc#* crm node online node2
*root@node2:/var/log#* crm status
...
Online: [ node1 node2 ]
Master/Slave Set: ms_drbd [drbd]
Masters: [ node1 ]
Stopped: [ node2 ]
Resource Group: pgserver
fs (ocf::heartbeat:Filesystem): Started node1
postgresql (lsb:postgresql): Started node1
Failed actions:
drbd_start_0 (node=node2, call=81, rc=1, status=complete,
last-rc-change=Sat Jan 3 12:05:32 2015
, queued=1118ms, exec=0ms
): unknown error

*root@node2:/var/log#* cat syslog | head -n 30
Jan 3 12:05:31 node2 crmd[918]: notice: do_state_transition: State
transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL
origin=abort_transition_graph ]
Jan 3 12:05:31 node2 cib[913]: notice: cib:diff: Diff: --- 0.29.3
Jan 3 12:05:31 node2 cib[913]: notice: cib:diff: Diff: +++ 0.30.1
027344551b46745123e4a52562e55974
Jan 3 12:05:31 node2 pengine[917]: notice: unpack_config: On loss of CCM
Quorum: Ignore
Jan 3 12:05:31 node2 pengine[917]: notice: LogActions: Start
drbd:1#011(node2)
Jan 3 12:05:31 node2 crmd[918]: notice: te_rsc_command: Initiating
action 46: notify drbd_pre_notify_start_0 on node1
Jan 3 12:05:31 node2 pengine[917]: notice: process_pe_message:
Calculated Transition 11: /var/lib/pacemaker/pengine/pe-input-11.bz2
Jan 3 12:05:32 node2 crmd[918]: notice: te_rsc_command: Initiating
action 10: start drbd:1_start_0 on node2 (local)
Jan 3 12:05:32 node2 drbd(drbd)[1931]: ERROR: vlv: Called drbdadm -c
/etc/drbd.conf check-resize vlv
Jan 3 12:05:32 node2 drbd(drbd)[1931]: ERROR: vlv: Exit code 255
Jan 3 12:05:32 node2 drbd(drbd)[1931]: ERROR: vlv: Command output:
Jan 3 12:05:32 node2 drbd(drbd)[1931]: ERROR: vlv: Called drbdadm -c
/etc/drbd.conf --peer node1 attach vlv
Jan 3 12:05:32 node2 drbd(drbd)[1931]: ERROR: vlv: Exit code 255
Jan 3 12:05:32 node2 drbd(drbd)[1931]: ERROR: vlv: Command output:
Jan 3 12:05:33 node2 drbd(drbd)[1931]: ERROR: vlv: Called drbdadm -c
/etc/drbd.conf --peer node1 attach vlv
Jan 3 12:05:33 node2 drbd(drbd)[1931]: ERROR: vlv: Exit code 255
Jan 3 12:05:33 node2 drbd(drbd)[1931]: ERROR: vlv: Command output:
Jan 3 12:05:33 node2 lrmd[915]: notice: operation_finished:
drbd_start_0:1931:stderr [ no suitable meta data found :( ]
Jan 3 12:05:33 node2 lrmd[915]: notice: operation_finished:
drbd_start_0:1931:stderr [ Command 'drbdmeta 0 v08 /dev/loop0 internal
check-resize' terminated with exit code 255 ]
Jan 3 12:05:33 node2 lrmd[915]: notice: operation_finished:
drbd_start_0:1931:stderr [ No valid meta data found ]
Jan 3 12:05:33 node2 lrmd[915]: notice: operation_finished:
drbd_start_0:1931:stderr [ Command 'drbdmeta 0 v08 /dev/loop0 internal
apply-al' terminated with exit code 255 ]
Jan 3 12:05:33 node2 lrmd[915]: notice: operation_finished:
drbd_start_0:1931:stderr [ No valid meta data found ]
Jan 3 12:05:33 node2 lrmd[915]: notice: operation_finished:
drbd_start_0:1931:stderr [ Command 'drbdmeta 0 v08 /dev/loop0 internal
apply-al' terminated with exit code 255 ]
Jan 3 12:05:33 node2 crmd[918]: notice: process_lrm_event: LRM operation
drbd_start_0 (call=81, rc=1, cib-update=79, confirmed=true) unknown error
Jan 3 12:05:33 node2 crmd[918]: notice: process_lrm_event:
node2-drbd_start_0:81 [ \n\n\n\n\n\n\n ]
Jan 3 12:05:33 node2 crmd[918]: warning: status_from_rc: Action 10
(drbd:1_start_0) on node2 failed (target: 0 vs. rc: 1): Error
Jan 3 12:05:33 node2 crmd[918]: warning: update_failcount: Updating
failcount for drbd on node2 after failed start: rc=1 (update=INFINITY,
time=1420304733)
Jan 3 12:05:33 node2 attrd[916]: notice: attrd_trigger_update: Sending
flush op to all hosts for: fail-count-drbd (INFINITY)
Jan 3 12:05:33 node2 attrd[916]: notice: attrd_perform_update: Sent
update 60: fail-count-drbd=INFINITY
Jan 3 12:05:33 node2 attrd[916]: notice: attrd_trigger_update: Sending
flush op to all hosts for: last-failure-drbd (1420304733)

root@node2:/var/log# cat syslog | grep ERROR | head -n 30
Jan 3 12:05:32 node2 drbd(drbd)[1931]: ERROR: vlv: Called drbdadm -c
/etc/drbd.conf check-resize vlv
Jan 3 12:05:32 node2 drbd(drbd)[1931]: ERROR: vlv: Exit code 255
Jan 3 12:05:32 node2 drbd(drbd)[1931]: ERROR: vlv: Command output:
Jan 3 12:05:32 node2 drbd(drbd)[1931]: ERROR: vlv: Called drbdadm -c
/etc/drbd.conf --peer node1 attach vlv
Jan 3 12:05:32 node2 drbd(drbd)[1931]: ERROR: vlv: Exit code 255
Jan 3 12:05:32 node2 drbd(drbd)[1931]: ERROR: vlv: Command output:
Jan 3 12:05:33 node2 drbd(drbd)[1931]: ERROR: vlv: Called drbdadm -c
/etc/drbd.conf --peer node1 attach vlv
Jan 3 12:05:33 node2 drbd(drbd)[1931]: ERROR: vlv: Exit code 255
Jan 3 12:05:33 node2 drbd(drbd)[1931]: ERROR: vlv: Command output:
Jan 3 12:05:33 node2 drbd(drbd)[2021]: ERROR: vlv: Called drbdadm -c
/etc/drbd.conf --peer node1 -v adjust vlv
Jan 3 12:05:33 node2 drbd(drbd)[2021]: ERROR: vlv: Exit code 1
Jan 3 12:05:33 node2 drbd(drbd)[2021]: ERROR: vlv: Command output:
drbdmeta 0 v08 /dev/loop0 internal apply-al

Here's the DRBD config file:

*root@node2:/var/log#* cat /etc/drbd.d/drbd.res
resource vlv {
device /dev/drbd0;
disk /dev/loop0;
meta-disk internal;
syncer { rate 150M; verify-alg md5; }
on node1 { address x.x.x.x:7788; }
on node2 { address y.y.y.y:7788; }
}

*root@node2:/var/log#* losetup -a
/dev/loop0: [fd01]:1314858 (/var/lib/vlv.drbd/vlv.img)

Re: A secondary DRBD is not brought online after "crm resource online" in Ubuntu 14.04 [ In reply to ]

bubble at hoster-ok

Jan 3, 2015, 11:27 AM

Post #2 of 3 (1287 views)

Permalink

03.01.2015 20:35, Dmitry Koterov wrote:
> Hello.
>
> Ubuntu 14.04, corosync 2.3.3, pacemaker 1.1.10. The cluster consists of
> 2 nodes (node1 and node2), when I run "crm node standby node2" and then,
> in a minute, "crm node online node2", DRBD secondary on node2 does not
> start. Logs say that "drbdadm -c /etc/drbd.conf check-resize vlv" fails
> with an error message: "No valid meta data found" on the onlining node.
> And, surprisingly, after I run "service drbd start" on node2 manually,
> everything becomes fine.
>
> Maybe something is broken in /usr/lib/ocf/resource.d/linbit/drbd, why
> cannot it start DRBD? Or I am misconfigured somehow? Could you please
> give an advice what to do?

Please see inline in drbdadm output.

>
> I have the following configuration (drbd + mount + postgresql, but
> postgresql is innocent here, so just ignore it):
>
> *root@node2:/var/log#* crm configure show
> node $id="1017525950" node2 attributes standby="off"
> node $id="1760315215" node1
> primitive drbd ocf:linbit:drbd \
> params drbd_resource="vlv" \
> op start interval="0" timeout="240" \
> op stop interval="0" timeout="120"
> primitive fs ocf:heartbeat:Filesystem \
> params device="/dev/drbd0" directory="/var/lib/vlv.drbd/root"
> options="noatime,nodiratime" fstype="xfs" \
> op start interval="0" timeout="300" \
> op stop interval="0" timeout="300"
> primitive postgresql lsb:postgresql \
> op monitor interval="4" timeout="60" \
> op start interval="0" timeout="60" \
> op stop interval="0" timeout="60"
> group pgserver fs postgresql
> ms ms_drbd drbd \
> meta master-max="1" master-node-max="1" clone-max="2" clone-node-max="1"
> notify="true"
> location cli-prefer-pgserver pgserver inf: node1
> colocation col_pgserver inf: pgserver ms_drbd:Master
> order ord_pgserver inf: ms_drbd:promote pgserver:start
> property $id="cib-bootstrap-options" dc-version="1.1.10-42f2063"
> cluster-infrastructure="corosync"
> stonith-enabled="false" no-quorum-policy="ignore"
> last-lrm-refresh="1420304078"
> rsc_defaults $id="rsc-options" \
> resource-stickiness="100"
>
>
> The cluster and DRBD statuses on node2 look healthy:
>
>
> *root@node2:/var/log#* crm status
> ...
> Online: [ node1 node2 ]
> Master/Slave Set: ms_drbd [drbd]
> Masters: [ node1 ]
> Slaves: [ node2 ]
> Resource Group: pgserver
> fs(ocf::heartbeat:Filesystem):Started node1
> postgresql(lsb:postgresql):Started node1
>
> *root@node2:/var/log#* cat /proc/drbd
> version: 8.4.3 (api:1/proto:86-101)
> srcversion: F97798065516C94BE0F27DC
> 0: cs:Connected ro:Secondary/Primary ds:Diskless/UpToDate C r-----
^^^^^^^^
that is your problem

> ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:0
>
>
> Now I switch node2 to standby and verify that DRBD on it has really shot
> down:
>
>
> *root@node1:/etc/rc2.d#* crm node standby node2
> *root@node2:/var/log#* cat /proc/drbd
> version: 8.4.3 (api:1/proto:86-101)
> srcversion: F97798065516C94BE0F27DC
> *root@node2:/var/log#*
> *
> *
> Then I switch node2 back online and see that DRBD has not been
> initialized and reattached again!
>
> *root@node2:/var/log#* >syslog
> *root@node1:/etc#* crm node online node2
> *root@node2:/var/log#* crm status
> ...
> Online: [ node1 node2 ]
> Master/Slave Set: ms_drbd [drbd]
> Masters: [ node1 ]
> Stopped: [ node2 ]
> Resource Group: pgserver
> fs(ocf::heartbeat:Filesystem):Started node1
> postgresql(lsb:postgresql):Started node1
> Failed actions:
> drbd_start_0 (node=node2, call=81, rc=1, status=complete,
> last-rc-change=Sat Jan 3 12:05:32 2015
> , queued=1118ms, exec=0ms
> ): unknown error
>
> *root@node2:/var/log#* cat syslog | head -n 30
> Jan 3 12:05:31 node2 crmd[918]: notice: do_state_transition: State
> transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC
> cause=C_FSA_INTERNAL origin=abort_transition_graph ]
> Jan 3 12:05:31 node2 cib[913]: notice: cib:diff: Diff: --- 0.29.3
> Jan 3 12:05:31 node2 cib[913]: notice: cib:diff: Diff: +++ 0.30.1
> 027344551b46745123e4a52562e55974
> Jan 3 12:05:31 node2 pengine[917]: notice: unpack_config: On loss of
> CCM Quorum: Ignore
> Jan 3 12:05:31 node2 pengine[917]: notice: LogActions: Start
> drbd:1#011(node2)
> Jan 3 12:05:31 node2 crmd[918]: notice: te_rsc_command: Initiating
> action 46: notify drbd_pre_notify_start_0 on node1
> Jan 3 12:05:31 node2 pengine[917]: notice: process_pe_message:
> Calculated Transition 11: /var/lib/pacemaker/pengine/pe-input-11.bz2
> Jan 3 12:05:32 node2 crmd[918]: notice: te_rsc_command: Initiating
> action 10: start drbd:1_start_0 on node2 (local)
> Jan 3 12:05:32 node2 drbd(drbd)[1931]: ERROR: vlv: Called drbdadm -c
> /etc/drbd.conf check-resize vlv
> Jan 3 12:05:32 node2 drbd(drbd)[1931]: ERROR: vlv: Exit code 255
> Jan 3 12:05:32 node2 drbd(drbd)[1931]: ERROR: vlv: Command output:
> Jan 3 12:05:32 node2 drbd(drbd)[1931]: ERROR: vlv: Called drbdadm -c
> /etc/drbd.conf --peer node1 attach vlv
> Jan 3 12:05:32 node2 drbd(drbd)[1931]: ERROR: vlv: Exit code 255
> Jan 3 12:05:32 node2 drbd(drbd)[1931]: ERROR: vlv: Command output:
> Jan 3 12:05:33 node2 drbd(drbd)[1931]: ERROR: vlv: Called drbdadm -c
> /etc/drbd.conf --peer node1 attach vlv
> Jan 3 12:05:33 node2 drbd(drbd)[1931]: ERROR: vlv: Exit code 255
> Jan 3 12:05:33 node2 drbd(drbd)[1931]: ERROR: vlv: Command output:
> Jan 3 12:05:33 node2 lrmd[915]: notice: operation_finished:
> drbd_start_0:1931:stderr [ no suitable meta data found :( ]
> Jan 3 12:05:33 node2 lrmd[915]: notice: operation_finished:
> drbd_start_0:1931:stderr [ Command 'drbdmeta 0 v08 /dev/loop0 internal
> check-resize' terminated with exit code 255 ]
> Jan 3 12:05:33 node2 lrmd[915]: notice: operation_finished:
> drbd_start_0:1931:stderr [ No valid meta data found ]
> Jan 3 12:05:33 node2 lrmd[915]: notice: operation_finished:
> drbd_start_0:1931:stderr [ Command 'drbdmeta 0 v08 /dev/loop0 internal
> apply-al' terminated with exit code 255 ]
> Jan 3 12:05:33 node2 lrmd[915]: notice: operation_finished:
> drbd_start_0:1931:stderr [ No valid meta data found ]
> Jan 3 12:05:33 node2 lrmd[915]: notice: operation_finished:
> drbd_start_0:1931:stderr [ Command 'drbdmeta 0 v08 /dev/loop0 internal
> apply-al' terminated with exit code 255 ]
> Jan 3 12:05:33 node2 crmd[918]: notice: process_lrm_event: LRM
> operation drbd_start_0 (call=81, rc=1, cib-update=79, confirmed=true)
> unknown error
> Jan 3 12:05:33 node2 crmd[918]: notice: process_lrm_event:
> node2-drbd_start_0:81 [ \n\n\n\n\n\n\n ]
> Jan 3 12:05:33 node2 crmd[918]: warning: status_from_rc: Action 10
> (drbd:1_start_0) on node2 failed (target: 0 vs. rc: 1): Error
> Jan 3 12:05:33 node2 crmd[918]: warning: update_failcount: Updating
> failcount for drbd on node2 after failed start: rc=1 (update=INFINITY,
> time=1420304733)
> Jan 3 12:05:33 node2 attrd[916]: notice: attrd_trigger_update:
> Sending flush op to all hosts for: fail-count-drbd (INFINITY)
> Jan 3 12:05:33 node2 attrd[916]: notice: attrd_perform_update: Sent
> update 60: fail-count-drbd=INFINITY
> Jan 3 12:05:33 node2 attrd[916]: notice: attrd_trigger_update:
> Sending flush op to all hosts for: last-failure-drbd (1420304733)
>
> root@node2:/var/log# cat syslog | grep ERROR | head -n 30
> Jan 3 12:05:32 node2 drbd(drbd)[1931]: ERROR: vlv: Called drbdadm -c
> /etc/drbd.conf check-resize vlv
> Jan 3 12:05:32 node2 drbd(drbd)[1931]: ERROR: vlv: Exit code 255
> Jan 3 12:05:32 node2 drbd(drbd)[1931]: ERROR: vlv: Command output:
> Jan 3 12:05:32 node2 drbd(drbd)[1931]: ERROR: vlv: Called drbdadm -c
> /etc/drbd.conf --peer node1 attach vlv
> Jan 3 12:05:32 node2 drbd(drbd)[1931]: ERROR: vlv: Exit code 255
> Jan 3 12:05:32 node2 drbd(drbd)[1931]: ERROR: vlv: Command output:
> Jan 3 12:05:33 node2 drbd(drbd)[1931]: ERROR: vlv: Called drbdadm -c
> /etc/drbd.conf --peer node1 attach vlv
> Jan 3 12:05:33 node2 drbd(drbd)[1931]: ERROR: vlv: Exit code 255
> Jan 3 12:05:33 node2 drbd(drbd)[1931]: ERROR: vlv: Command output:
> Jan 3 12:05:33 node2 drbd(drbd)[2021]: ERROR: vlv: Called drbdadm -c
> /etc/drbd.conf --peer node1 -v adjust vlv
> Jan 3 12:05:33 node2 drbd(drbd)[2021]: ERROR: vlv: Exit code 1
> Jan 3 12:05:33 node2 drbd(drbd)[2021]: ERROR: vlv: Command output:
> drbdmeta 0 v08 /dev/loop0 internal apply-al
>
>
> Here's the DRBD config file:
>
>
> *root@node2:/var/log#* cat /etc/drbd.d/drbd.res
> resource vlv {
> device /dev/drbd0;
> disk /dev/loop0;
> meta-disk internal;
> syncer { rate 150M; verify-alg md5; }
> on node1 { address x.x.x.x:7788; }
> on node2 { address y.y.y.y:7788; }
> }
>
> *root@node2:/var/log#* losetup -a
> /dev/loop0: [fd01]:1314858 (/var/lib/vlv.drbd/vlv.img)
>
>
>
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>

_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: A secondary DRBD is not brought online after "crm resource online" in Ubuntu 14.04 [ In reply to ]

dmitry.koterov at gmail

Jan 3, 2015, 1:53 PM

Post #3 of 3 (1288 views)

Permalink

Thanks.

I missed

drbdadm -- --force create-md vlv

on node2, that was the cause of the problem. Because of this there was
no drbd-minor-0.lkbd file in /var/lib/drbd on node2, so it did not work
properly (e.g. "drbdadm adjust" did not work, and that killed pacemaker).

BTW to speedup initial resync, one could execute:

drbdadm -- --force create-md vlv
drbdadm up vlv
drbdadm -- --clear-bitmap new-current-uuid vlv/0
drbdadm primary vlv
mkfs.xfs -d agcount=8 /dev/drbd0

P.S.
I played with strace and sources a bit and discovered that "service drbd
start" looked working only because it executes "drbdadm
adjust-with-progress", although pacemaker's script executes "drbdadm -v
adjust". In contrast to what is said in the docs, adjust and
adjust-with-progress differ not only by debug output: there used to be a
hack in drbdadm code which forces them to work differently:

for (stage = CFG_PREREQ; stage < __CFG_LAST; stage++) {
r = _run_deferred_cmds(stage);
if (r) {
if (!adjust_with_progress)
return 1; /* FIXME r? */
ret = 1;
}
}

This "FIXME" code is executed in "adjust" mode, but NOT when
"adjust-with-progress" is active. So if one uses "adjust-with-progress",
drbdadm silently skips failed steps and continues with the next ones, while
"adjust" fails on a first failed step.

On Sat, Jan 3, 2015 at 10:27 PM, Vladislav Bogdanov <bubble@hoster-ok.com>
wrote:

> 03.01.2015 20:35, Dmitry Koterov wrote:
>
>> Hello.
>>
>> Ubuntu 14.04, corosync 2.3.3, pacemaker 1.1.10. The cluster consists of
>> 2 nodes (node1 and node2), when I run "crm node standby node2" and then,
>> in a minute, "crm node online node2", DRBD secondary on node2 does not
>> start. Logs say that "drbdadm -c /etc/drbd.conf check-resize vlv" fails
>> with an error message: "No valid meta data found" on the onlining node.
>> And, surprisingly, after I run "service drbd start" on node2 manually,
>> everything becomes fine.
>>
>> Maybe something is broken in /usr/lib/ocf/resource.d/linbit/drbd, why
>> cannot it start DRBD? Or I am misconfigured somehow? Could you please
>> give an advice what to do?
>>
>
> Please see inline in drbdadm output.
>
>
>> I have the following configuration (drbd + mount + postgresql, but
>> postgresql is innocent here, so just ignore it):
>>
>> *root@node2:/var/log#* crm configure show
>>
>> node $id="1017525950" node2 attributes standby="off"
>> node $id="1760315215" node1
>> primitive drbd ocf:linbit:drbd \
>> params drbd_resource="vlv" \
>> op start interval="0" timeout="240" \
>> op stop interval="0" timeout="120"
>> primitive fs ocf:heartbeat:Filesystem \
>> params device="/dev/drbd0" directory="/var/lib/vlv.drbd/root"
>> options="noatime,nodiratime" fstype="xfs" \
>> op start interval="0" timeout="300" \
>> op stop interval="0" timeout="300"
>> primitive postgresql lsb:postgresql \
>> op monitor interval="4" timeout="60" \
>> op start interval="0" timeout="60" \
>> op stop interval="0" timeout="60"
>> group pgserver fs postgresql
>> ms ms_drbd drbd \
>> meta master-max="1" master-node-max="1" clone-max="2" clone-node-max="1"
>> notify="true"
>> location cli-prefer-pgserver pgserver inf: node1
>> colocation col_pgserver inf: pgserver ms_drbd:Master
>> order ord_pgserver inf: ms_drbd:promote pgserver:start
>> property $id="cib-bootstrap-options" dc-version="1.1.10-42f2063"
>> cluster-infrastructure="corosync"
>> stonith-enabled="false" no-quorum-policy="ignore"
>> last-lrm-refresh="1420304078"
>> rsc_defaults $id="rsc-options" \
>> resource-stickiness="100"
>>
>>
>> The cluster and DRBD statuses on node2 look healthy:
>>
>>
>> *root@node2:/var/log#* crm status
>> ...
>> Online: [ node1 node2 ]
>> Master/Slave Set: ms_drbd [drbd]
>> Masters: [ node1 ]
>> Slaves: [ node2 ]
>> Resource Group: pgserver
>> fs(ocf::heartbeat:Filesystem):Started node1
>> postgresql(lsb:postgresql):Started node1
>>
>> *root@node2:/var/log#* cat /proc/drbd
>> version: 8.4.3 (api:1/proto:86-101)
>> srcversion: F97798065516C94BE0F27DC
>> 0: cs:Connected ro:Secondary/Primary ds:Diskless/UpToDate C r-----
>>
> ^^^^^^^^
> that is your problem
>
> ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:0
>>
>>
>> Now I switch node2 to standby and verify that DRBD on it has really shot
>> down:
>>
>>
>> *root@node1:/etc/rc2.d#* crm node standby node2
>> *root@node2:/var/log#* cat /proc/drbd
>> version: 8.4.3 (api:1/proto:86-101)
>> srcversion: F97798065516C94BE0F27DC
>> *root@node2:/var/log#*
>> *
>> *
>> Then I switch node2 back online and see that DRBD has not been
>> initialized and reattached again!
>>
>> *root@node2:/var/log#* >syslog
>> *root@node1:/etc#* crm node online node2
>> *root@node2:/var/log#* crm status
>> ...
>> Online: [ node1 node2 ]
>> Master/Slave Set: ms_drbd [drbd]
>> Masters: [ node1 ]
>> Stopped: [ node2 ]
>> Resource Group: pgserver
>> fs(ocf::heartbeat:Filesystem):Started node1
>> postgresql(lsb:postgresql):Started node1
>> Failed actions:
>> drbd_start_0 (node=node2, call=81, rc=1, status=complete,
>> last-rc-change=Sat Jan 3 12:05:32 2015
>> , queued=1118ms, exec=0ms
>> ): unknown error
>>
>> *root@node2:/var/log#* cat syslog | head -n 30
>>
>> Jan 3 12:05:31 node2 crmd[918]: notice: do_state_transition: State
>> transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC
>> cause=C_FSA_INTERNAL origin=abort_transition_graph ]
>> Jan 3 12:05:31 node2 cib[913]: notice: cib:diff: Diff: --- 0.29.3
>> Jan 3 12:05:31 node2 cib[913]: notice: cib:diff: Diff: +++ 0.30.1
>> 027344551b46745123e4a52562e55974
>> Jan 3 12:05:31 node2 pengine[917]: notice: unpack_config: On loss of
>> CCM Quorum: Ignore
>> Jan 3 12:05:31 node2 pengine[917]: notice: LogActions: Start
>> drbd:1#011(node2)
>> Jan 3 12:05:31 node2 crmd[918]: notice: te_rsc_command: Initiating
>> action 46: notify drbd_pre_notify_start_0 on node1
>> Jan 3 12:05:31 node2 pengine[917]: notice: process_pe_message:
>> Calculated Transition 11: /var/lib/pacemaker/pengine/pe-input-11.bz2
>> Jan 3 12:05:32 node2 crmd[918]: notice: te_rsc_command: Initiating
>> action 10: start drbd:1_start_0 on node2 (local)
>> Jan 3 12:05:32 node2 drbd(drbd)[1931]: ERROR: vlv: Called drbdadm -c
>> /etc/drbd.conf check-resize vlv
>> Jan 3 12:05:32 node2 drbd(drbd)[1931]: ERROR: vlv: Exit code 255
>> Jan 3 12:05:32 node2 drbd(drbd)[1931]: ERROR: vlv: Command output:
>> Jan 3 12:05:32 node2 drbd(drbd)[1931]: ERROR: vlv: Called drbdadm -c
>> /etc/drbd.conf --peer node1 attach vlv
>> Jan 3 12:05:32 node2 drbd(drbd)[1931]: ERROR: vlv: Exit code 255
>> Jan 3 12:05:32 node2 drbd(drbd)[1931]: ERROR: vlv: Command output:
>> Jan 3 12:05:33 node2 drbd(drbd)[1931]: ERROR: vlv: Called drbdadm -c
>> /etc/drbd.conf --peer node1 attach vlv
>> Jan 3 12:05:33 node2 drbd(drbd)[1931]: ERROR: vlv: Exit code 255
>> Jan 3 12:05:33 node2 drbd(drbd)[1931]: ERROR: vlv: Command output:
>> Jan 3 12:05:33 node2 lrmd[915]: notice: operation_finished:
>> drbd_start_0:1931:stderr [ no suitable meta data found :( ]
>> Jan 3 12:05:33 node2 lrmd[915]: notice: operation_finished:
>> drbd_start_0:1931:stderr [ Command 'drbdmeta 0 v08 /dev/loop0 internal
>> check-resize' terminated with exit code 255 ]
>> Jan 3 12:05:33 node2 lrmd[915]: notice: operation_finished:
>> drbd_start_0:1931:stderr [ No valid meta data found ]
>> Jan 3 12:05:33 node2 lrmd[915]: notice: operation_finished:
>> drbd_start_0:1931:stderr [ Command 'drbdmeta 0 v08 /dev/loop0 internal
>> apply-al' terminated with exit code 255 ]
>> Jan 3 12:05:33 node2 lrmd[915]: notice: operation_finished:
>> drbd_start_0:1931:stderr [ No valid meta data found ]
>> Jan 3 12:05:33 node2 lrmd[915]: notice: operation_finished:
>> drbd_start_0:1931:stderr [ Command 'drbdmeta 0 v08 /dev/loop0 internal
>> apply-al' terminated with exit code 255 ]
>> Jan 3 12:05:33 node2 crmd[918]: notice: process_lrm_event: LRM
>> operation drbd_start_0 (call=81, rc=1, cib-update=79, confirmed=true)
>> unknown error
>> Jan 3 12:05:33 node2 crmd[918]: notice: process_lrm_event:
>> node2-drbd_start_0:81 [ \n\n\n\n\n\n\n ]
>> Jan 3 12:05:33 node2 crmd[918]: warning: status_from_rc: Action 10
>> (drbd:1_start_0) on node2 failed (target: 0 vs. rc: 1): Error
>> Jan 3 12:05:33 node2 crmd[918]: warning: update_failcount: Updating
>> failcount for drbd on node2 after failed start: rc=1 (update=INFINITY,
>> time=1420304733)
>> Jan 3 12:05:33 node2 attrd[916]: notice: attrd_trigger_update:
>> Sending flush op to all hosts for: fail-count-drbd (INFINITY)
>> Jan 3 12:05:33 node2 attrd[916]: notice: attrd_perform_update: Sent
>> update 60: fail-count-drbd=INFINITY
>> Jan 3 12:05:33 node2 attrd[916]: notice: attrd_trigger_update:
>> Sending flush op to all hosts for: last-failure-drbd (1420304733)
>>
>> root@node2:/var/log# cat syslog | grep ERROR | head -n 30
>> Jan 3 12:05:32 node2 drbd(drbd)[1931]: ERROR: vlv: Called drbdadm -c
>> /etc/drbd.conf check-resize vlv
>> Jan 3 12:05:32 node2 drbd(drbd)[1931]: ERROR: vlv: Exit code 255
>> Jan 3 12:05:32 node2 drbd(drbd)[1931]: ERROR: vlv: Command output:
>> Jan 3 12:05:32 node2 drbd(drbd)[1931]: ERROR: vlv: Called drbdadm -c
>> /etc/drbd.conf --peer node1 attach vlv
>> Jan 3 12:05:32 node2 drbd(drbd)[1931]: ERROR: vlv: Exit code 255
>> Jan 3 12:05:32 node2 drbd(drbd)[1931]: ERROR: vlv: Command output:
>> Jan 3 12:05:33 node2 drbd(drbd)[1931]: ERROR: vlv: Called drbdadm -c
>> /etc/drbd.conf --peer node1 attach vlv
>> Jan 3 12:05:33 node2 drbd(drbd)[1931]: ERROR: vlv: Exit code 255
>> Jan 3 12:05:33 node2 drbd(drbd)[1931]: ERROR: vlv: Command output:
>> Jan 3 12:05:33 node2 drbd(drbd)[2021]: ERROR: vlv: Called drbdadm -c
>> /etc/drbd.conf --peer node1 -v adjust vlv
>> Jan 3 12:05:33 node2 drbd(drbd)[2021]: ERROR: vlv: Exit code 1
>> Jan 3 12:05:33 node2 drbd(drbd)[2021]: ERROR: vlv: Command output:
>> drbdmeta 0 v08 /dev/loop0 internal apply-al
>>
>>
>> Here's the DRBD config file:
>>
>>
>> *root@node2:/var/log#* cat /etc/drbd.d/drbd.res
>> resource vlv {
>> device /dev/drbd0;
>> disk /dev/loop0;
>> meta-disk internal;
>> syncer { rate 150M; verify-alg md5; }
>> on node1 { address x.x.x.x:7788; }
>> on node2 { address y.y.y.y:7788; }
>> }
>>
>> *root@node2:/var/log#* losetup -a
>> /dev/loop0: [fd01]:1314858 (/var/lib/vlv.drbd/vlv.img)
>>
>>
>>
>>
>> _______________________________________________
>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>>
>>
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>