Mailing List Archive

Raid1 fails to stop
Hello all,

I got a report recently that a Raid1 resource failed to stop. It
turned out that some web management daemon called amDaemon kept
the MD devices open. After commit
2f8ec082408fb5c825a5fe30ec436c7e5208aa0a (attached), there is a
code now which stops such processes.

Do you have objections to this or do you think it should be done
in a different way? Note that it won't change stuff like
filesystems mounted or VG running on top of an MD device.
And did you ever see such a process opening the MD device? I'm
worried a bit about killing processes which are not supposed to
be removed.

Cheers,

Dejan
Re: Raid1 fails to stop [ In reply to ]
On 2012-09-18T18:28:29, Dejan Muhamedagic <dejan@suse.de> wrote:

> I got a report recently that a Raid1 resource failed to stop. It
> turned out that some web management daemon called amDaemon kept
> the MD devices open. After commit
> 2f8ec082408fb5c825a5fe30ec436c7e5208aa0a (attached), there is a
> code now which stops such processes.

Similar problem as we have with file systems, yes.

> +get_users_pids() {
> + local mddev=$1
> + local outp l
> + ocf_log debug "running lsof to list $mddev users..."
> + outp=`lsof $mddev | tail -n +2`
> + echo "$outp" | awk '{print $2}' | sort -u
> + echo "$outp" | while read l; do
> + ocf_log warn "$l"
> + done
> +}

Why not use "fuser"?

(And I think the ocf_log warn should be dropped before shipping.)

The code to do an escalating kill perhaps could be combined with the
Filesystem script.


Regards,
Lars

--
Architect Storage/HA
SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 21284 (AG Nürnberg)
"Experience is the name everyone gives to their mistakes." -- Oscar Wilde

_______________________________________________________
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/
Re: Raid1 fails to stop [ In reply to ]
On Tue, Sep 18, 2012 at 07:27:13PM +0200, Lars Marowsky-Bree wrote:
> On 2012-09-18T18:28:29, Dejan Muhamedagic <dejan@suse.de> wrote:
>
> > I got a report recently that a Raid1 resource failed to stop. It
> > turned out that some web management daemon called amDaemon kept
> > the MD devices open. After commit
> > 2f8ec082408fb5c825a5fe30ec436c7e5208aa0a (attached), there is a
> > code now which stops such processes.
>
> Similar problem as we have with file systems, yes.
>
> > +get_users_pids() {
> > + local mddev=$1
> > + local outp l
> > + ocf_log debug "running lsof to list $mddev users..."
> > + outp=`lsof $mddev | tail -n +2`
> > + echo "$outp" | awk '{print $2}' | sort -u
> > + echo "$outp" | while read l; do
> > + ocf_log warn "$l"
> > + done
> > +}
>
> Why not use "fuser"?

In my tests, fuser simply didn't show any processes. But I'll
check again.

> (And I think the ocf_log warn should be dropped before shipping.)

The warnings are logged because there really should not be any
processes holding the devices, i.e. that could reveal that the
system is not well configured or that some unforeseen activity
takes place. I'd recommend that we keep them.

> The code to do an escalating kill perhaps could be combined with the
> Filesystem script.

In the meantime I moved that code to ocf_stop_processes in
ocf-shellfuncs, so, yes, the Filesystem should use that too.

Cheers,

Dejan

> Regards,
> Lars
>
> --
> Architect Storage/HA
> SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 21284 (AG Nürnberg)
> "Experience is the name everyone gives to their mistakes." -- Oscar Wilde
>
> _______________________________________________________
> Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
> http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
> Home Page: http://linux-ha.org/
_______________________________________________________
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/