Mailing List Archive: Problem in SLES11 SP2 (actions on removed resources)?

Hi!

I have some strange problems with the current update of the cluster software in SLES11 SP2 (I didn't see such problems before the update):

sbd monitoring went crazy (reporting running sbds when there were none, compaining the unability to stop sbd when there was none), so I stopped it.

Now that I re-activated it, the cluster talks about resources that had been deleted days ago, like:
---
Apr 19 08:56:19 h05 attrd: [13083]: notice: attrd_local_callback: Sending full refresh (origin=crmd)
Apr 19 08:56:19 h05 attrd: [13083]: notice: attrd_trigger_update: Sending flush op to all hosts for: last-failure-prm_stonith_sbd (1365148953)
Apr 19 08:56:19 h05 cib: [13080]: info: cib_process_request: Operation complete: op cib_delete for section //node_state[@uname='h05']/lrm (origin=local/crmd/6835, version=0.744.19): ok (rc=0)
Apr 19 08:56:19 h05 crmd: [13085]: info: abort_transition_graph: te_update_diff:320 - Triggered transition abort (complete=1, tag=lrm_rsc_op, id=prm_v06_v06_raid1_last_0, magic=0:7;117:15:7:de539cd3-5895-4bcd-a388-ebad29a7b63d, cib=0.744.19) : Resource op removal
---

The resource prm_v06_v06_raid1 had been removed several days before in:
Apr 15 10:08:16 h05 cib: [13080]: info: cib_replace_notify: Replaced: 0.733.19 -> 0.734.1 from <null>

Interestingly a CIB dump minutes before the SBD-Change showed that the deleted resource still had an "lrm_resource" entry in the CIB:
---
<lrm_resource id="prm_v06_v06_raid1" type="Raid1" class="ocf" provider="heartbeat">
<lrm_rsc_op id="prm_v06_v06_raid1_last_0" operation_key="prm_v06_v06_raid1_monitor_0" operation="monitor" crm-debug-origin="build_active_RAs" crm_feature_set="3.0.6" transition-key="117:15:7:de539cd3-5895-4bcd-a388-ebad29a7b63d" transition-magic="0:7;117:15:7:de539cd3-5895-4bcd-a388-ebad29a7b63d" call-id="76" rc-code="7" op-status="0" interval="0" op-digest="0e6b2558abfd3cee98ee60cb7b03e6b0"/>
---
And the resource should have been removed before:
Apr 15 13:14:00 h05 crmd: [13085]: info: abort_transition_graph: te_update_diff:320 - Triggered transition abort (complete=1, tag=lrm_rsc_op, id=prm_v06_v06_raid1_last_0, magic=0:7;117:15:7:de5
39cd3-5895-4bcd-a388-ebad29a7b63d, cib=0.735.35) : Resource op removal

Isn't his very strange, or is there a reasonable explanation?

Regards,
Ulrich

_______________________________________________________
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/

On 2013-04-19T09:56:37, Ulrich Windl <Ulrich.Windl@rz.uni-regensburg.de> wrote:

> sbd monitoring went crazy (reporting running sbds when there were none, compaining the unability to stop sbd when there was none), so I stopped it.

What did you monitor? And what do you mean by "went crazy"?

(Besides, monitoring sbd is unnecessary anyway.)

> Now that I re-activated it, the cluster talks about resources that had been deleted days ago, like:

Hm, is this creating an actual problem? The status section may have
records about orphan resources, but that should be harmless. (I think a
recent change made this better, too.)

Regards,
Lars

--
Architect Storage/HA
SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 21284 (AG Nürnberg)
"Experience is the name everyone gives to their mistakes." -- Oscar Wilde

_______________________________________________________
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/