Mailing List Archive: R: [PATCH] Filesystem RA:

Hi everybody,
In my case (very similar to Junko's) when I disconnect the Fibre Channels
the "try_umount" procedure in RA Filesystem script doesn't work.

After the programmed attempts the active/passive cluster doesn't swap, and
the lvmdir resource is flagged as "failed" rather than "stopped".

I must say, even if I try to umount the /storage resource manually it
doesn't work because of sybase is using some files stored on it (busy); this
is why the RA cannot complete the operation in a clean mode. Is there a way
to force the swap anyway?

Some issues. I already tried:
1) This very test with a different optical SAN/storage in the past, and the
RA could always umount correctly the storage;
2) I modified the RA forcing the option "umount -l" even in case I've got a
ext4 FR rather than NFS;
3) I killed the hanged processes with the command "fuser -km /storage" but
the umount option always failed, and after a while I obtained a kernel panic

Is there a way to force the swap anyway, even if the umount is not clean?
Any suggestion?

Thanks for your time,
Regards
Guglielmo

P.S. lvmdir resource configuration

<primitive class="ocf" id="resource_lvmdir" provider="heartbeat"
type="Filesystem">
<instance_attributes id="resource_lvmdir-instance_attributes">
<nvpair id="resource_lvmdir-instance_attributes-device"
name="device" value="/dev/VG_SDG_Cluster_RM/LV_SDG_Cluster_RM"/>
<nvpair id="resource_lvmdir-instance_attributes-directory"
name="directory" value="/storage"/>
<nvpair id="resource_lvmdir-instance_attributes-fstype"
name="fstype" value="ext4"/>
</instance_attributes>
<meta_attributes id="resource_lvmdir-meta_attributes">
<nvpair id="resource_lvmdir-meta_attributes-multiple-active"
name="multiple-active" value="stop_start"/>
<nvpair id="resource_lvmdir-meta_attributes-migration-threshold"
name="migration-threshold" value="1"/>
<nvpair id="resource_lvmdir-meta_attributes-failure-timeout"
name="failure-timeout" value="0"/>
</meta_attributes>
<operations>
<op enabled="true" id="resource_lvmdir-startup" interval="60s"
name="monitor" on-fail="restart" requires="nothing" timeout="40s"/>
<op id="resource_lvmdir-start-0" interval="0" name="start"
on-fail="restart" requires="nothing" timeout="180s"/>
<op id="resource_lvmdir-stop-0" interval="0" name="stop"
on-fail="restart" requires="nothing" timeout="180s"/>
</operations>
</primitive>

2012/5/9 Junko IKEDA <tsukishima.ha@gmail.com>:
> Hi,
>
> In my case, the umount succeed when the Fibre Channels is
> disconnected, so it seemed that the handling status file caused a
> longer failover, as Dejan said.
> If the umount fails, it will go into a timeout, might call stonith
> action, and this case also makes sense (though I couldn't see this).
>
> I tried the following setup;
>
> (1) timeout : multipath > RA
> multipath timeout = 120s
> Filesystem RA stop timeout = 60s
>
> (2) timeout : multipath < RA
> multipath timeout = 60s
> Filesystem RA stop timeout = 120s
>
> case (1), Filesystem_stop() fails. The hanging FC causes the stop timeout.
>
> case (2), Filesystem_stop() succeeds.
> Filesystem is hanging out, but line 758 and 759 succeed(rc=0).
> The status file is no more inaccessible, so it remains on the
> filesystem, in fact.
>
>> > 758 if [ -f "$STATUSFILE" ]; then
>> > 759 rm -f ${STATUSFILE}
>> > 760 if [ $? -ne 0 ]; then
>
> so, the line 761 might not be called as expected.
>
>> > 761 ocf_log warn "Failed to remove status file ${STATUSFILE}."
>
>
> By the way, my concern is the unexpected stop timeout and the longer
> fail over time, if OCF_CHECK_LEVEL is set as 20, it would be better to
> try remove its status file just in case.
> It can handle the case (2) if the user wants to recover this case with
STONITH.
>
>
> Thanks,
> Junko
>
> 2012/5/8 Dejan Muhamedagic <dejan@suse.de>:
>> Hi Lars,
>>
>> On Tue, May 08, 2012 at 01:35:16PM +0200, Lars Marowsky-Bree wrote:
>>> On 2012-05-08T12:08:27, Dejan Muhamedagic <dejan@suse.de> wrote:
>>>
>>> > > In the default (without OCF_CHECK_LEVE), it's enough to try
>>> > > unmount the file system, isn't it?
>>> > > https://github.com/ClusterLabs/resource-agents/blob/master/heart
>>> > > beat/Filesystem#L774
>>> >
>>> > I don't see a need to remove the STATUSFILE at all, as that may
>>> > (and as you observed it) prevent the filesystem from stopping.
>>> > Perhaps to skip it altogether? If nobody objects let's just remove
>>> > this code:
>>> >
>>> > 758 if [ -f "$STATUSFILE" ]; then
>>> > 759 rm -f ${STATUSFILE}
>>> > 760 if [ $? -ne 0 ]; then
>>> > 761 ocf_log warn "Failed to remove status file
${STATUSFILE}."
>>> > 762 fi
>>> > 763 fi
>>>
>>> That would mean you can no longer differentiate between a "crash"
>>> and a clean unmount.
>>
>> One could take a look at the logs. I guess that a crash would
>> otherwise be noticeable as well :)
>>
>>> A hanging FC/SAN is likely to be unable to flush any other dirty
>>> buffers too, as well, so the umount may not necessarily succeed w/o
>>> errors. I think it's unreasonable to expect that the node will
>>> survive such a scenario w/o recovery.
>>
>> True. However, in case of network attached storage or other transient
>> errors it may lead to an unnecessary timeout followed by fencing,
>> i.e. the chance for a longer failover time is higher.
>> Just leaving a file around may not justify the risk.
>>
>> Junko-san, what was your experience?
>>
>> Cheers,
>>
>> Dejan
>>
>>> Regards,
>>> Lars
>>>
>>> --
>>> Architect Storage/HA
>>> SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix
>>> Imendörffer, HRB 21284 (AG Nürnberg) "Experience is the name
>>> everyone gives to their mistakes." -- Oscar Wilde
>>>
>>> _______________________________________________________
>>> Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
>>> http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
>>> Home Page: http://linux-ha.org/
>> _______________________________________________________
>> Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
>> http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
>> Home Page: http://linux-ha.org/
_______________________________________________________
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/

_______________________________________________________
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/

Hi

The correct way for that to have been handled, given you additional detail would have been for the node to have received a STONITH.

Things that you should check:
1 STONITH device configured correctly and operational.
2 the " on fail" for any file system cluster resource stop should be " fence".
3 you need to review your constraints to ensure that the order and relationship between SYBASE and file system resource needs to be corrected so that SYBASE is stopped first.

Hope this helps

Darren

Sent from my iPhone

On 09/04/2013, at 11:57 PM, "Guglielmo Abbruzzese" <g.abbruzzese@resi.it> wrote:

> Hi everybody,
> In my case (very similar to Junko's) when I disconnect the Fibre Channels
> the "try_umount" procedure in RA Filesystem script doesn't work.
>
> After the programmed attempts the active/passive cluster doesn't swap, and
> the lvmdir resource is flagged as "failed" rather than "stopped".
>
> I must say, even if I try to umount the /storage resource manually it
> doesn't work because of sybase is using some files stored on it (busy); this
> is why the RA cannot complete the operation in a clean mode. Is there a way
> to force the swap anyway?
>
> Some issues. I already tried:
> 1) This very test with a different optical SAN/storage in the past, and the
> RA could always umount correctly the storage;
> 2) I modified the RA forcing the option "umount -l" even in case I've got a
> ext4 FR rather than NFS;
> 3) I killed the hanged processes with the command "fuser -km /storage" but
> the umount option always failed, and after a while I obtained a kernel panic
>
> Is there a way to force the swap anyway, even if the umount is not clean?
> Any suggestion?
>
> Thanks for your time,
> Regards
> Guglielmo
>
> P.S. lvmdir resource configuration
>
> <primitive class="ocf" id="resource_lvmdir" provider="heartbeat"
> type="Filesystem">
> <instance_attributes id="resource_lvmdir-instance_attributes">
> <nvpair id="resource_lvmdir-instance_attributes-device"
> name="device" value="/dev/VG_SDG_Cluster_RM/LV_SDG_Cluster_RM"/>
> <nvpair id="resource_lvmdir-instance_attributes-directory"
> name="directory" value="/storage"/>
> <nvpair id="resource_lvmdir-instance_attributes-fstype"
> name="fstype" value="ext4"/>
> </instance_attributes>
> <meta_attributes id="resource_lvmdir-meta_attributes">
> <nvpair id="resource_lvmdir-meta_attributes-multiple-active"
> name="multiple-active" value="stop_start"/>
> <nvpair id="resource_lvmdir-meta_attributes-migration-threshold"
> name="migration-threshold" value="1"/>
> <nvpair id="resource_lvmdir-meta_attributes-failure-timeout"
> name="failure-timeout" value="0"/>
> </meta_attributes>
> <operations>
> <op enabled="true" id="resource_lvmdir-startup" interval="60s"
> name="monitor" on-fail="restart" requires="nothing" timeout="40s"/>
> <op id="resource_lvmdir-start-0" interval="0" name="start"
> on-fail="restart" requires="nothing" timeout="180s"/>
> <op id="resource_lvmdir-stop-0" interval="0" name="stop"
> on-fail="restart" requires="nothing" timeout="180s"/>
> </operations>
> </primitive>
>
> 2012/5/9 Junko IKEDA <tsukishima.ha@gmail.com>:
>> Hi,
>>
>> In my case, the umount succeed when the Fibre Channels is
>> disconnected, so it seemed that the handling status file caused a
>> longer failover, as Dejan said.
>> If the umount fails, it will go into a timeout, might call stonith
>> action, and this case also makes sense (though I couldn't see this).
>>
>> I tried the following setup;
>>
>> (1) timeout : multipath > RA
>> multipath timeout = 120s
>> Filesystem RA stop timeout = 60s
>>
>> (2) timeout : multipath < RA
>> multipath timeout = 60s
>> Filesystem RA stop timeout = 120s
>>
>> case (1), Filesystem_stop() fails. The hanging FC causes the stop timeout.
>>
>> case (2), Filesystem_stop() succeeds.
>> Filesystem is hanging out, but line 758 and 759 succeed(rc=0).
>> The status file is no more inaccessible, so it remains on the
>> filesystem, in fact.
>>
>>>> 758 if [ -f "$STATUSFILE" ]; then
>>>> 759 rm -f ${STATUSFILE}
>>>> 760 if [ $? -ne 0 ]; then
>>
>> so, the line 761 might not be called as expected.
>>
>>>> 761 ocf_log warn "Failed to remove status file ${STATUSFILE}."
>>
>>
>> By the way, my concern is the unexpected stop timeout and the longer
>> fail over time, if OCF_CHECK_LEVEL is set as 20, it would be better to
>> try remove its status file just in case.
>> It can handle the case (2) if the user wants to recover this case with
> STONITH.
>>
>>
>> Thanks,
>> Junko
>>
>> 2012/5/8 Dejan Muhamedagic <dejan@suse.de>:
>>> Hi Lars,
>>>
>>> On Tue, May 08, 2012 at 01:35:16PM +0200, Lars Marowsky-Bree wrote:
>>>> On 2012-05-08T12:08:27, Dejan Muhamedagic <dejan@suse.de> wrote:
>>>>
>>>>>> In the default (without OCF_CHECK_LEVE), it's enough to try
>>>>>> unmount the file system, isn't it?
>>>>>> https://github.com/ClusterLabs/resource-agents/blob/master/heart
>>>>>> beat/Filesystem#L774
>>>>>
>>>>> I don't see a need to remove the STATUSFILE at all, as that may
>>>>> (and as you observed it) prevent the filesystem from stopping.
>>>>> Perhaps to skip it altogether? If nobody objects let's just remove
>>>>> this code:
>>>>>
>>>>> 758 if [ -f "$STATUSFILE" ]; then
>>>>> 759 rm -f ${STATUSFILE}
>>>>> 760 if [ $? -ne 0 ]; then
>>>>> 761 ocf_log warn "Failed to remove status file
> ${STATUSFILE}."
>>>>> 762 fi
>>>>> 763 fi
>>>>
>>>> That would mean you can no longer differentiate between a "crash"
>>>> and a clean unmount.
>>>
>>> One could take a look at the logs. I guess that a crash would
>>> otherwise be noticeable as well :)
>>>
>>>> A hanging FC/SAN is likely to be unable to flush any other dirty
>>>> buffers too, as well, so the umount may not necessarily succeed w/o
>>>> errors. I think it's unreasonable to expect that the node will
>>>> survive such a scenario w/o recovery.
>>>
>>> True. However, in case of network attached storage or other transient
>>> errors it may lead to an unnecessary timeout followed by fencing,
>>> i.e. the chance for a longer failover time is higher.
>>> Just leaving a file around may not justify the risk.
>>>
>>> Junko-san, what was your experience?
>>>
>>> Cheers,
>>>
>>> Dejan
>>>
>>>> Regards,
>>>> Lars
>>>>
>>>> --
>>>> Architect Storage/HA
>>>> SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix
>>>> ImendÃ¶rffer, HRB 21284 (AG NÃ¼rnberg) "Experience is the name
>>>> everyone gives to their mistakes." -- Oscar Wilde
>>>>
>>>> _______________________________________________________
>>>> Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
>>>> http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
>>>> Home Page: http://linux-ha.org/
>>> _______________________________________________________
>>> Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
>>> http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
>>> Home Page: http://linux-ha.org/
> _______________________________________________________
> Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
> http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
> Home Page: http://linux-ha.org/
>
> _______________________________________________________
> Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
> http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
> Home Page: http://linux-ha.org/
_______________________________________________________
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/