Mailing List Archive

[PATCH] Filesystem RA: remove a status file only when OCF_CHECK_LEVEL is set as 20
Hi,

This is a small patch for Filesystem RA.

When we mount a shared storage without OCF_CHECK_LEVEL parameter,
Filesystem_stop() has possibly to cause an unexpected timeout.

For example;
(1) mount the shared storage without OCF_CHECK_LEVEL
(2) disconnect Fibre Channels
(3) service heartbeat stop

When Filesystem_stop() is called, it tries to remove the STATUSFILE on
the shared storage.
STATUSFILE is only created when OCF_CHECK_LEVEL is set as 20,
RA can not access it and time-out.
https://github.com/ClusterLabs/resource-agents/blob/master/heartbeat/Filesystem#L756

In the default (without OCF_CHECK_LEVE), it's enough to try unmount
the file system, isn't it?
https://github.com/ClusterLabs/resource-agents/blob/master/heartbeat/Filesystem#L774

Regards,
Junko IKEDA

NTT DATA INTELLILINK CORPORATION
Re: [PATCH] Filesystem RA: remove a status file only when OCF_CHECK_LEVEL is set as 20 [ In reply to ]
Hi Junko-san,

On Tue, May 08, 2012 at 05:18:36PM +0900, Junko IKEDA wrote:
> Hi,
>
> This is a small patch for Filesystem RA.
>
> When we mount a shared storage without OCF_CHECK_LEVEL parameter,
> Filesystem_stop() has possibly to cause an unexpected timeout.
>
> For example;
> (1) mount the shared storage without OCF_CHECK_LEVEL
> (2) disconnect Fibre Channels
> (3) service heartbeat stop
>
> When Filesystem_stop() is called, it tries to remove the STATUSFILE on
> the shared storage.
> STATUSFILE is only created when OCF_CHECK_LEVEL is set as 20,
> RA can not access it and time-out.
> https://github.com/ClusterLabs/resource-agents/blob/master/heartbeat/Filesystem#L756
>
> In the default (without OCF_CHECK_LEVE), it's enough to try unmount
> the file system, isn't it?
> https://github.com/ClusterLabs/resource-agents/blob/master/heartbeat/Filesystem#L774

I don't see a need to remove the STATUSFILE at all, as that may
(and as you observed it) prevent the filesystem from stopping.
Perhaps to skip it altogether? If nobody objects let's just
remove this code:

758 if [ -f "$STATUSFILE" ]; then
759 rm -f ${STATUSFILE}
760 if [ $? -ne 0 ]; then
761 ocf_log warn "Failed to remove status file ${STATUSFILE}."
762 fi
763 fi

Cheers,

Dejan

> Regards,
> Junko IKEDA
>
> NTT DATA INTELLILINK CORPORATION


> _______________________________________________________
> Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
> http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
> Home Page: http://linux-ha.org/

_______________________________________________________
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/
Re: [PATCH] Filesystem RA: remove a status file only when OCF_CHECK_LEVEL is set as 20 [ In reply to ]
On 2012-05-08T12:08:27, Dejan Muhamedagic <dejan@suse.de> wrote:

> > In the default (without OCF_CHECK_LEVE), it's enough to try unmount
> > the file system, isn't it?
> > https://github.com/ClusterLabs/resource-agents/blob/master/heartbeat/Filesystem#L774
>
> I don't see a need to remove the STATUSFILE at all, as that may
> (and as you observed it) prevent the filesystem from stopping.
> Perhaps to skip it altogether? If nobody objects let's just
> remove this code:
>
> 758 if [ -f "$STATUSFILE" ]; then
> 759 rm -f ${STATUSFILE}
> 760 if [ $? -ne 0 ]; then
> 761 ocf_log warn "Failed to remove status file ${STATUSFILE}."
> 762 fi
> 763 fi

That would mean you can no longer differentiate between a "crash" and a
clean unmount.

A hanging FC/SAN is likely to be unable to flush any other dirty buffers
too, as well, so the umount may not necessarily succeed w/o errors. I
think it's unreasonable to expect that the node will survive such a
scenario w/o recovery.


Regards,
Lars

--
Architect Storage/HA
SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 21284 (AG Nürnberg)
"Experience is the name everyone gives to their mistakes." -- Oscar Wilde

_______________________________________________________
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/
Re: [PATCH] Filesystem RA: remove a status file only when OCF_CHECK_LEVEL is set as 20 [ In reply to ]
Hi Lars,

On Tue, May 08, 2012 at 01:35:16PM +0200, Lars Marowsky-Bree wrote:
> On 2012-05-08T12:08:27, Dejan Muhamedagic <dejan@suse.de> wrote:
>
> > > In the default (without OCF_CHECK_LEVE), it's enough to try unmount
> > > the file system, isn't it?
> > > https://github.com/ClusterLabs/resource-agents/blob/master/heartbeat/Filesystem#L774
> >
> > I don't see a need to remove the STATUSFILE at all, as that may
> > (and as you observed it) prevent the filesystem from stopping.
> > Perhaps to skip it altogether? If nobody objects let's just
> > remove this code:
> >
> > 758 if [ -f "$STATUSFILE" ]; then
> > 759 rm -f ${STATUSFILE}
> > 760 if [ $? -ne 0 ]; then
> > 761 ocf_log warn "Failed to remove status file ${STATUSFILE}."
> > 762 fi
> > 763 fi
>
> That would mean you can no longer differentiate between a "crash" and a
> clean unmount.

One could take a look at the logs. I guess that a crash would
otherwise be noticeable as well :)

> A hanging FC/SAN is likely to be unable to flush any other dirty buffers
> too, as well, so the umount may not necessarily succeed w/o errors. I
> think it's unreasonable to expect that the node will survive such a
> scenario w/o recovery.

True. However, in case of network attached storage or other
transient errors it may lead to an unnecessary timeout followed
by fencing, i.e. the chance for a longer failover time is higher.
Just leaving a file around may not justify the risk.

Junko-san, what was your experience?

Cheers,

Dejan

> Regards,
> Lars
>
> --
> Architect Storage/HA
> SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 21284 (AG Nürnberg)
> "Experience is the name everyone gives to their mistakes." -- Oscar Wilde
>
> _______________________________________________________
> Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
> http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
> Home Page: http://linux-ha.org/
_______________________________________________________
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/
Re: [PATCH] Filesystem RA: remove a status file only when OCF_CHECK_LEVEL is set as 20 [ In reply to ]
Hi,

In my case, the umount succeed when the Fibre Channels is disconnected,
so it seemed that the handling status file caused a longer failover,
as Dejan said.
If the umount fails, it will go into a timeout, might call stonith
action, and this case also makes sense (though I couldn't see this).

I tried the following setup;

(1) timeout : multipath > RA
multipath timeout = 120s
Filesystem RA stop timeout = 60s

(2) timeout : multipath < RA
multipath timeout = 60s
Filesystem RA stop timeout = 120s

case (1), Filesystem_stop() fails. The hanging FC causes the stop timeout.

case (2), Filesystem_stop() succeeds.
Filesystem is hanging out, but line 758 and 759 succeed(rc=0).
The status file is no more inaccessible, so it remains on the
filesystem, in fact.

> > 758 if [ -f "$STATUSFILE" ]; then
> > 759 rm -f ${STATUSFILE}
> > 760 if [ $? -ne 0 ]; then

so, the line 761 might not be called as expected.

> > 761 ocf_log warn "Failed to remove status file ${STATUSFILE}."


By the way, my concern is the unexpected stop timeout and the longer
fail over time,
if OCF_CHECK_LEVEL is set as 20, it would be better to try remove its
status file just in case.
It can handle the case (2) if the user wants to recover this case with STONITH.


Thanks,
Junko

2012/5/8 Dejan Muhamedagic <dejan@suse.de>:
> Hi Lars,
>
> On Tue, May 08, 2012 at 01:35:16PM +0200, Lars Marowsky-Bree wrote:
>> On 2012-05-08T12:08:27, Dejan Muhamedagic <dejan@suse.de> wrote:
>>
>> > > In the default (without OCF_CHECK_LEVE), it's enough to try unmount
>> > > the file system, isn't it?
>> > > https://github.com/ClusterLabs/resource-agents/blob/master/heartbeat/Filesystem#L774
>> >
>> > I don't see a need to remove the STATUSFILE at all, as that may
>> > (and as you observed it) prevent the filesystem from stopping.
>> > Perhaps to skip it altogether? If nobody objects let's just
>> > remove this code:
>> >
>> >  758         if [ -f "$STATUSFILE" ]; then
>> >  759             rm -f ${STATUSFILE}
>> >  760             if [ $? -ne 0 ]; then
>> >  761                 ocf_log warn "Failed to remove status file ${STATUSFILE}."
>> >  762             fi
>> >  763         fi
>>
>> That would mean you can no longer differentiate between a "crash" and a
>> clean unmount.
>
> One could take a look at the logs. I guess that a crash would
> otherwise be noticeable as well :)
>
>> A hanging FC/SAN is likely to be unable to flush any other dirty buffers
>> too, as well, so the umount may not necessarily succeed w/o errors. I
>> think it's unreasonable to expect that the node will survive such a
>> scenario w/o recovery.
>
> True. However, in case of network attached storage or other
> transient errors it may lead to an unnecessary timeout followed
> by fencing, i.e. the chance for a longer failover time is higher.
> Just leaving a file around may not justify the risk.
>
> Junko-san, what was your experience?
>
> Cheers,
>
> Dejan
>
>> Regards,
>>     Lars
>>
>> --
>> Architect Storage/HA
>> SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 21284 (AG Nürnberg)
>> "Experience is the name everyone gives to their mistakes." -- Oscar Wilde
>>
>> _______________________________________________________
>> Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
>> http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
>> Home Page: http://linux-ha.org/
> _______________________________________________________
> Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
> http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
> Home Page: http://linux-ha.org/
_______________________________________________________
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/
Re: [PATCH] Filesystem RA: remove a status file only when OCF_CHECK_LEVEL is set as 20 [ In reply to ]
Hi,

Is my case hard to understand?
"multipath" means the Fibre Channels, there are two cables for redundancy.

Thanks,
Junko

2012/5/9 Junko IKEDA <tsukishima.ha@gmail.com>:
> Hi,
>
> In my case, the umount succeed when the Fibre Channels is disconnected,
> so it seemed that the handling status file caused a longer failover,
> as Dejan said.
> If the umount fails, it will go into a timeout, might call stonith
> action, and this case also makes sense (though I couldn't see this).
>
> I tried the following setup;
>
> (1) timeout : multipath > RA
> multipath timeout = 120s
> Filesystem RA stop timeout = 60s
>
> (2) timeout : multipath < RA
> multipath timeout = 60s
> Filesystem RA stop timeout = 120s
>
> case (1), Filesystem_stop() fails. The hanging FC causes the stop timeout.
>
> case (2), Filesystem_stop() succeeds.
> Filesystem is hanging out, but line 758 and 759 succeed(rc=0).
> The status file is no more inaccessible, so it remains on the
> filesystem, in fact.
>
>> > 758 if [ -f "$STATUSFILE" ]; then
>> > 759 rm -f ${STATUSFILE}
>> > 760 if [ $? -ne 0 ]; then
>
> so, the line 761 might not be called as expected.
>
>> > 761 ocf_log warn "Failed to remove status file ${STATUSFILE}."
>
>
> By the way, my concern is the unexpected stop timeout and the longer
> fail over time,
> if OCF_CHECK_LEVEL is set as 20, it would be better to try remove its
> status file just in case.
> It can handle the case (2) if the user wants to recover this case with STONITH.
>
>
> Thanks,
> Junko
>
> 2012/5/8 Dejan Muhamedagic <dejan@suse.de>:
>> Hi Lars,
>>
>> On Tue, May 08, 2012 at 01:35:16PM +0200, Lars Marowsky-Bree wrote:
>>> On 2012-05-08T12:08:27, Dejan Muhamedagic <dejan@suse.de> wrote:
>>>
>>> > > In the default (without OCF_CHECK_LEVE), it's enough to try unmount
>>> > > the file system, isn't it?
>>> > > https://github.com/ClusterLabs/resource-agents/blob/master/heartbeat/Filesystem#L774
>>> >
>>> > I don't see a need to remove the STATUSFILE at all, as that may
>>> > (and as you observed it) prevent the filesystem from stopping.
>>> > Perhaps to skip it altogether? If nobody objects let's just
>>> > remove this code:
>>> >
>>> >  758         if [ -f "$STATUSFILE" ]; then
>>> >  759             rm -f ${STATUSFILE}
>>> >  760             if [ $? -ne 0 ]; then
>>> >  761                 ocf_log warn "Failed to remove status file ${STATUSFILE}."
>>> >  762             fi
>>> >  763         fi
>>>
>>> That would mean you can no longer differentiate between a "crash" and a
>>> clean unmount.
>>
>> One could take a look at the logs. I guess that a crash would
>> otherwise be noticeable as well :)
>>
>>> A hanging FC/SAN is likely to be unable to flush any other dirty buffers
>>> too, as well, so the umount may not necessarily succeed w/o errors. I
>>> think it's unreasonable to expect that the node will survive such a
>>> scenario w/o recovery.
>>
>> True. However, in case of network attached storage or other
>> transient errors it may lead to an unnecessary timeout followed
>> by fencing, i.e. the chance for a longer failover time is higher.
>> Just leaving a file around may not justify the risk.
>>
>> Junko-san, what was your experience?
>>
>> Cheers,
>>
>> Dejan
>>
>>> Regards,
>>>     Lars
>>>
>>> --
>>> Architect Storage/HA
>>> SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 21284 (AG Nürnberg)
>>> "Experience is the name everyone gives to their mistakes." -- Oscar Wilde
>>>
>>> _______________________________________________________
>>> Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
>>> http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
>>> Home Page: http://linux-ha.org/
>> _______________________________________________________
>> Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
>> http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
>> Home Page: http://linux-ha.org/
_______________________________________________________
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/