Mailing List Archive

RFC: pidfile handling; current worst case: stop failure and node level fencing
Recent discussions with Dejan made me again more prominently aware of a
few issues we probably all know about, but usually dismis as having not
much relevance in the real-world.

The facts:

* a pidfile typically only stores a pid
* a pidfile may "stale", not properly cleaned up
when the pid it references died.
* pids are recycled

This is more an issue if kernel.pid_max is small
wrt the number of processes created per unit time,
for example on some embeded systems,
or on some very busy systems.

But it may be an issue on any system,
even a mostly idle one, given "bad luck^W timing",
see below.

A common idiom in resource agents is to

kill_that_pid_and_wait_until_dead()
{
local pid=$1
is_alive $pid || return 0
kill -TERM $pid
while is_alive $pid ; sleep 1; done
return 0
}

The naïve implementation of is_alive() is
is_alive() { kill -0 $1 ; }

This is the main issue:
-----------------------

If the last-used-pid is just a bit smaller then $pid,
during the sleep 1, $pid may die,
and the OS may already have created a new process with that exact pid.

Using above "is_alive", kill_that_pid() will not notice that the
to-be-killed pid has actually terminated while that new process runs.
Which may be a very long time if that is some other long running daemon.

This may result in stop failure and resulting node level fencing.

The question is, which better way do we have to detect if some pid died
after we killed it. Or, related, and even better: how to detect if the
process currently running with some pid is in fact still the process
referenced by the pidfile.

I have two suggestions.

(I am trying to avoid bashisms in here.
But maybe I overlook some.
Also, the code is typed, not sourced from some working script,
so there may be logic bugs and typos.
My intent should be obvious enough, though.)

using "cd /proc/$pid; stat ."
-----------------------------

# this is most likely linux specific
kill_that_pid_and_wait_until_dead()
{
local pid=$1
(
cd /proc/$pid || return 0
kill -TERM $pid
while stat . ; sleep 1; done
)
return 0
}

Once pid dies, /proc/$pid will become stale (but not completely go away,
because it is our cwd), and stat . will return "No such process".

Variants:

using test -ef
--------------

exec 7</proc/$pid || return 0
kill -TERM $pid
while :; do
exec 8</proc/$pid || break
test /proc/self/fd/7 -ef /proc/self/fd/8 || break
sleep 1
done
exec 7<&- 8<&-

using stat -c %Y /proc/$pid
---------------------------

ctime0=$(stat -c %Y /proc/$pid)
kill -TERM $pid
while ctime=$(stat -c %Y /proc/$pid) && [ $ctime = $ctime0 ] ; do sleep 1; done


Why not use the inode number I hear you say.
Because it is not stable. Sorry.
Don't believe me? Don't want to read kernel source?
Try it yourself:

sleep 120 & k=$!
stat /proc/$k
echo 3 > /proc/sys/vm/drop_caches
stat /proc/$k

But that leads me to an other proposal:
store the starttime together with the pid in a pidfile.

For linux that would be:

(see proc(5) for /proc/pid/stat field meanings.
note that (comm) may contain both whitespace and ")",
which is the reason for my sed | cut below)

spawn_create_exclusive_pid_starttime()
{
local pidfile=$1
shift
local reset
case $- in *C*) reset=":";; *) set -C; reset="set +C";; esac
if ! exec 3>$pidfile ; then
$reset
return 1
fi

$reset
setsid sh -c '
read pid _ < /proc/self/stat
starttime=$(sed -e 's/^.*) //' /proc/$pid/stat | cut -d' ' -f 20)
>&3 echo $pid $starttime
3>&- exec "$@"
' -- "$@" &
return 0
}

It does not seem possible to cycle through all available pids
within fractions of time smaller than the granularity of starttime,
so "pid starttime" should be a unique tuple (until the next reboot --
at least on linux, starttime is measured as strictly monotonic "uptime").


If we have "pid starttime" in the pidfile,
we can:

get_proc_pid_starttime()
{
proc_pid_starttime=$(sed -e 's/^.*) //' /proc/$pid/stat) || return 1
proc_pid_starttime=$(echo "$proc_pid_starttime" | cut -d' ' -f 20)
}

kill_using_pidfile()
{
local pidfile=$1
local pid starttime proc_pid_starttime

test -e $pidfile || return # already dead
read pid starttime <$pidfile || return # unreadable

# check pid and starttime are both present, numeric only, ...
# I have a version that distinguishes 16 distinct error
# conditions; this is the short version only...

local i=0
while
get_proc_pid_starttime &&
[ "$starttime" = "$proc_pid_starttime" ]
do
: $(( i+=1 ))
[ $i = 1 ] && kill -TERM $pid
# MAYBE # [ $i = 30 ] && kill -KILL $pid
sleep 1
done

# it's not (anymore) the process we where looking for
# remove that pidfile.

rm -f "$pidfile"
}

In other OSes, ps may be able to give a good enough equivalent?

Any comments?

Thanks,
Lars

_______________________________________________________
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/
Re: RFC: pidfile handling; current worst case: stop failure and node level fencing [ In reply to ]
For the Assimilation code I use the full pathname of the binary from
/proc to tell if it's "one of mine". That's not perfect if you're using
an interpreted language. It works quite well for compiled languages.


On 10/20/2014 01:17 PM, Lars Ellenberg wrote:
> Recent discussions with Dejan made me again more prominently aware of a
> few issues we probably all know about, but usually dismis as having not
> much relevance in the real-world.
>
> The facts:
>
> * a pidfile typically only stores a pid
> * a pidfile may "stale", not properly cleaned up
> when the pid it references died.
> * pids are recycled
>
> This is more an issue if kernel.pid_max is small
> wrt the number of processes created per unit time,
> for example on some embeded systems,
> or on some very busy systems.
>
> But it may be an issue on any system,
> even a mostly idle one, given "bad luck^W timing",
> see below.
>
> A common idiom in resource agents is to
>
> kill_that_pid_and_wait_until_dead()
> {
> local pid=$1
> is_alive $pid || return 0
> kill -TERM $pid
> while is_alive $pid ; sleep 1; done
> return 0
> }
>
> The naïve implementation of is_alive() is
> is_alive() { kill -0 $1 ; }
>
> This is the main issue:
> -----------------------
>
> If the last-used-pid is just a bit smaller then $pid,
> during the sleep 1, $pid may die,
> and the OS may already have created a new process with that exact pid.
>
> Using above "is_alive", kill_that_pid() will not notice that the
> to-be-killed pid has actually terminated while that new process runs.
> Which may be a very long time if that is some other long running daemon.
>
> This may result in stop failure and resulting node level fencing.
>
> The question is, which better way do we have to detect if some pid died
> after we killed it. Or, related, and even better: how to detect if the
> process currently running with some pid is in fact still the process
> referenced by the pidfile.
>
> I have two suggestions.
>
> (I am trying to avoid bashisms in here.
> But maybe I overlook some.
> Also, the code is typed, not sourced from some working script,
> so there may be logic bugs and typos.
> My intent should be obvious enough, though.)
>
> using "cd /proc/$pid; stat ."
> -----------------------------
>
> # this is most likely linux specific
> kill_that_pid_and_wait_until_dead()
> {
> local pid=$1
> (
> cd /proc/$pid || return 0
> kill -TERM $pid
> while stat . ; sleep 1; done
> )
> return 0
> }
>
> Once pid dies, /proc/$pid will become stale (but not completely go away,
> because it is our cwd), and stat . will return "No such process".
>
> Variants:
>
> using test -ef
> --------------
>
> exec 7</proc/$pid || return 0
> kill -TERM $pid
> while :; do
> exec 8</proc/$pid || break
> test /proc/self/fd/7 -ef /proc/self/fd/8 || break
> sleep 1
> done
> exec 7<&- 8<&-
>
> using stat -c %Y /proc/$pid
> ---------------------------
>
> ctime0=$(stat -c %Y /proc/$pid)
> kill -TERM $pid
> while ctime=$(stat -c %Y /proc/$pid) && [ $ctime = $ctime0 ] ; do sleep 1; done
>
>
> Why not use the inode number I hear you say.
> Because it is not stable. Sorry.
> Don't believe me? Don't want to read kernel source?
> Try it yourself:
>
> sleep 120 & k=$!
> stat /proc/$k
> echo 3 > /proc/sys/vm/drop_caches
> stat /proc/$k
>
> But that leads me to an other proposal:
> store the starttime together with the pid in a pidfile.
>
> For linux that would be:
>
> (see proc(5) for /proc/pid/stat field meanings.
> note that (comm) may contain both whitespace and ")",
> which is the reason for my sed | cut below)
>
> spawn_create_exclusive_pid_starttime()
> {
> local pidfile=$1
> shift
> local reset
> case $- in *C*) reset=":";; *) set -C; reset="set +C";; esac
> if ! exec 3>$pidfile ; then
> $reset
> return 1
> fi
>
> $reset
> setsid sh -c '
> read pid _ < /proc/self/stat
> starttime=$(sed -e 's/^.*) //' /proc/$pid/stat | cut -d' ' -f 20)
> >&3 echo $pid $starttime
> 3>&- exec "$@"
> ' -- "$@" &
> return 0
> }
>
> It does not seem possible to cycle through all available pids
> within fractions of time smaller than the granularity of starttime,
> so "pid starttime" should be a unique tuple (until the next reboot --
> at least on linux, starttime is measured as strictly monotonic "uptime").
>
>
> If we have "pid starttime" in the pidfile,
> we can:
>
> get_proc_pid_starttime()
> {
> proc_pid_starttime=$(sed -e 's/^.*) //' /proc/$pid/stat) || return 1
> proc_pid_starttime=$(echo "$proc_pid_starttime" | cut -d' ' -f 20)
> }
>
> kill_using_pidfile()
> {
> local pidfile=$1
> local pid starttime proc_pid_starttime
>
> test -e $pidfile || return # already dead
> read pid starttime <$pidfile || return # unreadable
>
> # check pid and starttime are both present, numeric only, ...
> # I have a version that distinguishes 16 distinct error
> # conditions; this is the short version only...
>
> local i=0
> while
> get_proc_pid_starttime &&
> [ "$starttime" = "$proc_pid_starttime" ]
> do
> : $(( i+=1 ))
> [ $i = 1 ] && kill -TERM $pid
> # MAYBE # [ $i = 30 ] && kill -KILL $pid
> sleep 1
> done
>
> # it's not (anymore) the process we where looking for
> # remove that pidfile.
>
> rm -f "$pidfile"
> }
>
> In other OSes, ps may be able to give a good enough equivalent?
>
> Any comments?
>
> Thanks,
> Lars
>
> _______________________________________________________
> Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
> http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
> Home Page: http://linux-ha.org/

_______________________________________________________
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/
Re: RFC: pidfile handling; current worst case: stop failure and node level fencing [ In reply to ]
On Mon, Oct 20, 2014 at 03:04:31PM -0600, Alan Robertson wrote:
> On 10/20/2014 02:52 PM, Alan Robertson wrote:
> > For the Assimilation code I use the full pathname of the binary from
> > /proc to tell if it's "one of mine". That's not perfect if you're using
> > an interpreted language. It works quite well for compiled languages.

It works just as well (or as bad) from interpreted languages:
readlink /proc/$pid/exe
(very old linux has a fsid:inode encoding there, but I digress)

But that does solve a different subset of problems,
has race conditions in itself, and breaks if you have updated the binary
since start of that service (which does happen).

It does not fully address what I am talking about.

> I somehow missed that you were talking about resource agents.

Not exclusively.
But any solution should easily work for any caller and an easily
unserstood set of conventions, regarless of implementation language.

> But shouldn't the CRM guarantee that no more than one is running anyway
> - making some conditions a lot less likely?

You miss the point.

Point being:
kill -TERM $pid
while kill -0 $pid ; do sleep 1; done
may "never" terminate.

> If you care, the source for the code I mentioned is here:
> http://hg.linux-ha.org/assimilation/file/tip/clientlib/misc.c

Lars

_______________________________________________________
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/
Re: RFC: pidfile handling; current worst case: stop failure and node level fencing [ In reply to ]
On 10/20/2014 03:21 PM, Lars Ellenberg wrote:
> On Mon, Oct 20, 2014 at 03:04:31PM -0600, Alan Robertson wrote:
>> On 10/20/2014 02:52 PM, Alan Robertson wrote:
>>> For the Assimilation code I use the full pathname of the binary from
>>> /proc to tell if it's "one of mine". That's not perfect if you're using
>>> an interpreted language. It works quite well for compiled languages.
> It works just as well (or as bad) from interpreted languages:
> readlink /proc/$pid/exe
> (very old linux has a fsid:inode encoding there, but I digress)
>
> But that does solve a different subset of problems,
> has race conditions in itself, and breaks if you have updated the binary
> since start of that service (which does happen).
>
> It does not fully address what I am talking about.
It only breaks if you change the *name* of the binary. Updating the
binary contents has no effect. Changing the name of the binary is
pretty unusual - or so it seems to me. Did I miss something?

And if you do, you should stop with the binary with the old version and
start it with the new one. Very few methods are going to deal well with
radical changes in the service without stopping it with the old script,
updating, and starting with the new script.

I don't believe I see the race condition.

It won't loop, and it's not fooled by pid wraparound. What else are you
looking for? [Guess I missed something else here]
_______________________________________________________
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/
Re: RFC: pidfile handling; current worst case: stop failure and node level fencing [ In reply to ]
>On Mon, Oct 20, 2014 at 11:21:36PM +0200, Lars Ellenberg wrote:
>> On Mon, Oct 20, 2014 at 03:04:31PM -0600, Alan Robertson wrote:
>> > On 10/20/2014 02:52 PM, Alan Robertson wrote:
>> > > For the Assimilation code I use the full pathname of the binary from
>> > > /proc to tell if it's "one of mine". That's not perfect if you're using
>> > > an interpreted language. It works quite well for compiled languages.
>>
>> It works just as well (or as bad) from interpreted languages:
>> readlink /proc/$pid/exe
>> (very old linux has a fsid:inode encoding there, but I digress)
>>
>> But that does solve a different subset of problems,
>> has race conditions in itself, and breaks if you have updated the binary
>> since start of that service (which does happen).

Sorry, I lost the original.
Alan then wrote:

> It only breaks if you change the *name* of the binary. Updating the
> binary contents has no effect. Changing the name of the binary is
> pretty unusual - or so it seems to me. Did I miss something?
>
> And if you do, you should stop with the binary with the old version and
> start it with the new one. Very few methods are going to deal well with
> radical changes in the service without stopping it with the old script,
> updating, and starting with the new script.

Well, the "pid starttime" method does...

> I don't believe I see the race condition.

Does not matter.

> It won't loop, and it's not fooled by pid wraparound. What else are you
> looking for? [Guess I missed something else here]

pid + exe is certainly is better than the pid alone.
It may even be "good enough".

But it still has shortcomings.

/proc/pid/exe is not stable,
(changes to "deleted" if the binary is deleted)
could be accounted for.

/proc/pid/exe links to the interpreter (python, bash, java, whatever)

Even if it is a "real" binary, (pid, /proc/pid/exe) is
still NOT unique for pid re-use after wrap around:
think different instances of mysql or whatever.
(yes, it gets increasingly unlikely...)

However, (pid, starttime) *is* unique (for the lifetime of the pidfile,
as long as that is stored on tmpfs resp. cleared after reboot).
(unless you tell me you can eat through pid_max, or at least the
currently unused pids, within the granularity of starttime...)

So that's why I propose to use (pid, starttime) tuple.

If you see problems with (pid, starttime), please speak up.
If you have something *better*, please speak up.
If you just have something "different",
feel free to tell us anyways :-)

Lars

_______________________________________________________
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/
Re: RFC: pidfile handling; current worst case: stop failure and node level fencing [ In reply to ]
On 10/21/2014 2:29 AM, Lars Ellenberg wrote:
>> On Mon, Oct 20, 2014 at 11:21:36PM +0200, Lars Ellenberg wrote:
>>> On Mon, Oct 20, 2014 at 03:04:31PM -0600, Alan Robertson wrote:
>>>> On 10/20/2014 02:52 PM, Alan Robertson wrote:
>>>>> For the Assimilation code I use the full pathname of the binary from
>>>>> /proc to tell if it's "one of mine". That's not perfect if you're using
>>>>> an interpreted language. It works quite well for compiled languages.
>>> It works just as well (or as bad) from interpreted languages:
>>> readlink /proc/$pid/exe
>>> (very old linux has a fsid:inode encoding there, but I digress)
>>>
>>> But that does solve a different subset of problems,
>>> has race conditions in itself, and breaks if you have updated the binary
>>> since start of that service (which does happen).
> Sorry, I lost the original.
> Alan then wrote:
>
>> It only breaks if you change the *name* of the binary. Updating the
>> binary contents has no effect. Changing the name of the binary is
>> pretty unusual - or so it seems to me. Did I miss something?
>>
>> And if you do, you should stop with the binary with the old version and
>> start it with the new one. Very few methods are going to deal well with
>> radical changes in the service without stopping it with the old script,
>> updating, and starting with the new script.
> Well, the "pid starttime" method does...
>
>> I don't believe I see the race condition.
> Does not matter.
>
>> It won't loop, and it's not fooled by pid wraparound. What else are you
>> looking for? [Guess I missed something else here]
> pid + exe is certainly is better than the pid alone.
> It may even be "good enough".
>
> But it still has shortcomings.
>
> /proc/pid/exe is not stable,
> (changes to "deleted" if the binary is deleted)
> could be accounted for.
>
> /proc/pid/exe links to the interpreter (python, bash, java, whatever)
>
> Even if it is a "real" binary, (pid, /proc/pid/exe) is
> still NOT unique for pid re-use after wrap around:
> think different instances of mysql or whatever.
> (yes, it gets increasingly unlikely...)
For most cases, a persistent daemon is a compiled language. Of course
not all, but all the ones I personally care about ;-)
>
> However, (pid, starttime) *is* unique (for the lifetime of the pidfile,
> as long as that is stored on tmpfs resp. cleared after reboot).
> (unless you tell me you can eat through pid_max, or at least the
> currently unused pids, within the granularity of starttime...)
>
> So that's why I propose to use (pid, starttime) tuple.
>
> If you see problems with (pid, starttime), please speak up.
> If you have something *better*, please speak up.
> If you just have something "different",
> feel free to tell us anyways :-)

The contents of the pidfile are specified by the LSB (or at least they
were at some time in the past) That's why I use just the pid. The
current version specifies that the first line of a pidfile consists of
one or more numbers, and any subsequent lines should be ignored. If you
go the way you do, I'd suggest other data be put on a separate lines.

You might compare what you're doing to
http://refspecs.linuxbase.org/LSB_3.1.1/LSB-Core-generic/LSB-Core-generic/iniscrptfunc.html

Instead of storing the start time explicitly, you could touch the pid
file's creation time to match that of the process ;-) That's harder to
do in the shell, unfortunately...

-- Alan
_______________________________________________________
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/
Re: RFC: pidfile handling; current worst case: stop failure and node level fencing [ In reply to ]
On 20/10/14 20:17, Lars Ellenberg wrote:
> In other OSes, ps may be able to give a good enough equivalent?

Debian's start-stop-daemon executable might be worth considering here -
it's used extensively in the init script infrastructure of Debian (and
derivatives, over several different OS kernels), and so is well
debugged, and in my experience beats re-implementing it's functionality.

http://anonscm.debian.org/cgit/dpkg/dpkg.git/tree/utils/start-stop-daemon.c

I've used it in pacemaker resource control scripts before successfully -
it's kill expression support is very useful in particular on HA.

Tim.


NAME

start-stop-daemon - start and stop system daemon programs

SYNOPSIS

start-stop-daemon [option...] command

DESCRIPTION

start-stop-daemon is used to control the creation and termination of

system-level processes. Using one of the matching options,

start-stop-daemon can be configured to find existing instances of a

running process.

Note: unless --pid or --pidfile are specified, start-stop-daemon

behaves similar to killall(1). start-stop-daemon will scan the process

table looking for any processes which match the process name, parent

pid, uid, and/or gid (if specified). Any matching process will prevent

--start from starting the daemon. All matching processes will be sent

the TERM signal (or the one specified via --signal or --retry) if

--stop is specified. For daemons which have long-lived children which

need to live through a --stop, you must specify a pidfile.

COMMANDS

-S, --start [--] arguments

Check for the existence of a specified process. If such a

process exists, start-stop-daemon does nothing, and exits with

error status 1 (0 if --oknodo is specified). If such a process

does not exist, it starts an instance, using either the exe‐

cutable specified by --exec or, if specified, by --startas. Any

arguments given after -- on the command line are passed unmodi‐

fied to the program being started.

-K, --stop

Checks for the existence of a specified process. If such a

process exists, start-stop-daemon sends it the signal specified

by --signal, and exits with error status 0. If such a process

does not exist, start-stop-daemon exits with error status 1 (0

if --oknodo is specified). If --retry is specified, then

start-stop-daemon will check that the process(es) have termi‐

nated.

-T, --status

Check for the existence of a specified process, and returns an

exit status code, according to the LSB Init Script Actions.

-H, --help

Show usage information and exit.

-V, --version

Show the program version and exit.

OPTIONS

Matching options

--pid pid

Check for a process with the specified pid. The pid must be a

number greater than 0.

--ppid ppid

Check for a process with the specified ppid (parent pid). The

ppid must be a number greater than 0.

-p, --pidfile pid-file

Check whether a process has created the file pid-file. Note:

using this matching option alone might cause unintended pro‐

cesses to be acted on, if the old process terminated without

being able to remove the pid-file.

-x, --exec executable

Check for processes that are instances of this executable. The

executable argument should be an absolute pathname. Note: this

might not work as intended with interpreted scripts, as the exe‐

cutable will point to the interpreter. Take into account pro‐

cesses running from inside a chroot will also be matched, so

other match restrictions might be needed.

-n, --name process-name

Check for processes with the name process-name. The process-name

is usually the process filename, but it could have been changed

by the process itself. Note: on most systems this information is

retrieved from the process comm name from the kernel, which

tends to have a relatively short length limit (assuming more

than 15 characters is non-portable).

-u, --user username|uid

Check for processes owned by the user specified by username or

uid. Note: using this matching option alone will cause all pro‐

cesses matching the user to be acted on.

Generic options

-g, --group group|gid

Change to group or gid when starting the process.

-s, --signal signal

With --stop, specifies the signal to send to processes being

stopped (default TERM).

-R, --retry timeout|schedule

With --stop, specifies that start-stop-daemon is to check

whether the process(es) do finish. It will check repeatedly

whether any matching processes are running, until none are. If

the processes do not exit it will then take further action as

determined by the schedule.

If timeout is specified instead of schedule, then the schedule

signal/timeout/KILL/timeout is used, where signal is the signal

specified with --signal.

schedule is a list of at least two items separated by slashes

(/); each item may be -signal-number or [-]signal-name, which

means to send that signal, or timeout, which means to wait that

many seconds for processes to exit, or forever, which means to

repeat the rest of the schedule forever if necessary.

If the end of the schedule is reached and forever is not speci‐

fied, then start-stop-daemon exits with error status 2. If a

schedule is specified, then any signal specified with --signal

is ignored.

-a, --startas pathname

With --start, start the process specified by pathname. If not

specified, defaults to the argument given to --exec.

-t, --test

Print actions that would be taken and set appropriate return

value, but take no action.

-o, --oknodo

Return exit status 0 instead of 1 if no actions are (would be)

taken.

-q, --quiet

Do not print informational messages; only display error mes‐

sages.

-c, --chuid username|uid[:group|gid]

Change to this username/uid before starting the process. You can

also specify a group by appending a :, then the group or gid in

the same way as you would for the `chown' command (user:group).

If a user is specified without a group, the primary GID for that

user is used. When using this option you must realize that the

primary and supplemental groups are set as well, even if the

--group option is not specified. The --group option is only for

groups that the user isn't normally a member of (like adding per

process group membership for generic users like nobody).

-r, --chroot root

Chdir and chroot to root before starting the process. Please

note that the pidfile is also written after the chroot.

-d, --chdir path

Chdir to path before starting the process. This is done after

the chroot if the -r|--chroot option is set. When not specified,

start-stop-daemon will chdir to the root directory before start‐

ing the process.

-b, --background

Typically used with programs that don't detach on their own.

This option will force start-stop-daemon to fork before starting

the process, and force it into the background. WARNING:

start-stop-daemon cannot check the exit status if the process

fails to execute for any reason. This is a last resort, and is

only meant for programs that either make no sense forking on

their own, or where it's not feasible to add the code for them

to do this themselves.

-C, --no-close

Do not close any file descriptor when forcing the daemon into

the background. Used for debugging purposes to see the process

output, or to redirect file descriptors to log the process out‐

put. Only relevant when using --background.

-N, --nicelevel int

This alters the priority of the process before starting it.

-P, --procsched policy:priority

This alters the process scheduler policy and priority of the

process before starting it. The priority can be optionally spec‐

ified by appending a : followed by the value. The default prior‐

ity is 0. The currently supported policy values are other, fifo

and rr.

-I, --iosched class:priority

This alters the IO scheduler class and priority of the process

before starting it. The priority can be optionally specified by

appending a : followed by the value. The default priority is 4,

unless class is idle, then priority will always be 7. The cur‐

rently supported values for class are idle, best-effort and

real-time.

-k, --umask mask

This sets the umask of the process before starting it.

-m, --make-pidfile

Used when starting a program that does not create its own pid

file. This option will make start-stop-daemon create the file

referenced with --pidfile and place the pid into it just before

executing the process. Note, the file will only be removed when

stopping the program if --remove-pidfile is used. NOTE: This

feature may not work in all cases. Most notably when the program

being executed forks from its main process. Because of this, it

is usually only useful when combined with the --background

option.

--remove-pidfile

Used when stopping a program that does not remove its own pid

file. This option will make start-stop-daemon remove the file

referenced with --pidfile after terminating the process.

-v, --verbose

Print verbose informational messages.

EXIT STATUS

0 The requested action was performed. If --oknodo was specified,

it's also possible that nothing had to be done. This can happen

when --start was specified and a matching process was already

running, or when --stop was specified and there were no matching

processes.

1 If --oknodo was not specified and nothing was done.

2 If --stop and --retry were specified, but the end of the sched‐

ule was reached and the processes were still running.

3 Any other error.

When using the --status command, the following status codes are

returned:

0 Program is running.

1 Program is not running and the pid file exists.

3 Program is not running.

4 Unable to determine program status.

EXAMPLE

Start the food daemon, unless one is already running (a process named

food, running as user food, with pid in food.pid):

start-stop-daemon --start --oknodo --user food --name food \

--pidfile /run/food.pid --startas /usr/sbin/food \

--chuid food -- --daemon

Send SIGTERM to food and wait up to 5 seconds for it to stop:

start-stop-daemon --stop --oknodo --user food --name food \

--pidfile /run/food.pid --retry 5

Demonstration of a custom schedule for stopping food:

start-stop-daemon --stop --oknodo --user food --name food \

--pidfile /run/food.pid --retry=TERM/30/KILL/5

Debian Project 2014-03-26 start-stop-daemon(8)



--
South East Open Source Solutions Limited
Registered in England and Wales with company number 06134732.
Registered Office: 2 Powell Gardens, Redhill, Surrey, RH1 1TQ
VAT number: 900 6633 53 http://seoss.co.uk/ +44-(0)1273-808309

_______________________________________________________
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/
Re: RFC: pidfile handling; current worst case: stop failure and node level fencing [ In reply to ]
Hi Lars,

On Mon, Oct 20, 2014 at 09:17:29PM +0200, Lars Ellenberg wrote:
>
> Recent discussions with Dejan made me again more prominently aware of a
> few issues we probably all know about, but usually dismis as having not
> much relevance in the real-world.
>
> The facts:
>
> * a pidfile typically only stores a pid
> * a pidfile may "stale", not properly cleaned up
> when the pid it references died.
> * pids are recycled
>
> This is more an issue if kernel.pid_max is small
> wrt the number of processes created per unit time,
> for example on some embeded systems,
> or on some very busy systems.
>
> But it may be an issue on any system,
> even a mostly idle one, given "bad luck^W timing",
> see below.
>
> A common idiom in resource agents is to
>
> kill_that_pid_and_wait_until_dead()
> {
> local pid=$1
> is_alive $pid || return 0
> kill -TERM $pid
> while is_alive $pid ; sleep 1; done
> return 0
> }
>
> The naïve implementation of is_alive() is
> is_alive() { kill -0 $1 ; }
>
> This is the main issue:
> -----------------------
>
> If the last-used-pid is just a bit smaller then $pid,
> during the sleep 1, $pid may die,
> and the OS may already have created a new process with that exact pid.
>
> Using above "is_alive", kill_that_pid() will not notice that the
> to-be-killed pid has actually terminated while that new process runs.
> Which may be a very long time if that is some other long running daemon.
>
> This may result in stop failure and resulting node level fencing.
>
> The question is, which better way do we have to detect if some pid died
> after we killed it. Or, related, and even better: how to detect if the
> process currently running with some pid is in fact still the process
> referenced by the pidfile.
>
> I have two suggestions.
>
> (I am trying to avoid bashisms in here.
> But maybe I overlook some.
> Also, the code is typed, not sourced from some working script,
> so there may be logic bugs and typos.
> My intent should be obvious enough, though.)
>
> using "cd /proc/$pid; stat ."
> -----------------------------
>
> # this is most likely linux specific

Apparently not. According to Wikipedia at least, most UNIX
platforms (including BSD and Solaris) support /proc/$pid.

> kill_that_pid_and_wait_until_dead()
> {
> local pid=$1
> (
> cd /proc/$pid || return 0
> kill -TERM $pid
> while stat . ; sleep 1; done

I'd rather "test -d ." (it's more common in shell scripts and
runs faster). BTW, on my laptop, test -d is so fast that the
process doesn't get removed before it runs and the while loop
always gets executed. In that respect, "stat" or "ls -d" performs
better.

> )
> return 0
> }
>
> Once pid dies, /proc/$pid will become stale (but not completely go away,
> because it is our cwd), and stat . will return "No such process".

This seems to be a very elegant solution and I cannot find fault
with it. Short and easy to understand too.

[.... Skipping other proposals, some of which are quite exotic :) ]

> kill_using_pidfile()
> {
> local pidfile=$1
> local pid starttime proc_pid_starttime
>
> test -e $pidfile || return # already dead
> read pid starttime <$pidfile || return # unreadable

I'd assume that we (the caller) knows what the process should
look like in the process table, as in say command and arguments.
We could also test that if there's a possibility that the process
left but the PID file somehow stayed behind.

> # check pid and starttime are both present, numeric only, ...
> # I have a version that distinguishes 16 distinct error

Wow!

> # conditions; this is the short version only...
>
> local i=0
> while
> get_proc_pid_starttime &&
> [ "$starttime" = "$proc_pid_starttime" ]
> do
> : $(( i+=1 ))
> [ $i = 1 ] && kill -TERM $pid
> # MAYBE # [ $i = 30 ] && kill -KILL $pid
> sleep 1
> done
>
> # it's not (anymore) the process we where looking for
> # remove that pidfile.
>
> rm -f "$pidfile"
> }
>
> In other OSes, ps may be able to give a good enough equivalent?
>
> Any comments?

I'd just go with the "cd /proc/$pid" thing. Perhaps add a test
for "ps -o cmd $pid" output.

And thanks for giving this such a thorough analysis!

Thanks,

Dejan

> Thanks,
> Lars
>
> _______________________________________________________
> Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
> http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
> Home Page: http://linux-ha.org/
_______________________________________________________
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/
Re: RFC: pidfile handling; current worst case: stop failure and node level fencing [ In reply to ]
Hi Alan,

On Mon, Oct 20, 2014 at 02:52:13PM -0600, Alan Robertson wrote:
> For the Assimilation code I use the full pathname of the binary from
> /proc to tell if it's "one of mine". That's not perfect if you're using
> an interpreted language. It works quite well for compiled languages.

Yes, though not perfect, that may be good enough. I supposed that
the probability that the very same program gets the same recycled
pid is rather low. (Or is it?)

Cheers,

Dejan

>
> On 10/20/2014 01:17 PM, Lars Ellenberg wrote:
> > Recent discussions with Dejan made me again more prominently aware of a
> > few issues we probably all know about, but usually dismis as having not
> > much relevance in the real-world.
> >
> > The facts:
> >
> > * a pidfile typically only stores a pid
> > * a pidfile may "stale", not properly cleaned up
> > when the pid it references died.
> > * pids are recycled
> >
> > This is more an issue if kernel.pid_max is small
> > wrt the number of processes created per unit time,
> > for example on some embeded systems,
> > or on some very busy systems.
> >
> > But it may be an issue on any system,
> > even a mostly idle one, given "bad luck^W timing",
> > see below.
> >
> > A common idiom in resource agents is to
> >
> > kill_that_pid_and_wait_until_dead()
> > {
> > local pid=$1
> > is_alive $pid || return 0
> > kill -TERM $pid
> > while is_alive $pid ; sleep 1; done
> > return 0
> > }
> >
> > The naïve implementation of is_alive() is
> > is_alive() { kill -0 $1 ; }
> >
> > This is the main issue:
> > -----------------------
> >
> > If the last-used-pid is just a bit smaller then $pid,
> > during the sleep 1, $pid may die,
> > and the OS may already have created a new process with that exact pid.
> >
> > Using above "is_alive", kill_that_pid() will not notice that the
> > to-be-killed pid has actually terminated while that new process runs.
> > Which may be a very long time if that is some other long running daemon.
> >
> > This may result in stop failure and resulting node level fencing.
> >
> > The question is, which better way do we have to detect if some pid died
> > after we killed it. Or, related, and even better: how to detect if the
> > process currently running with some pid is in fact still the process
> > referenced by the pidfile.
> >
> > I have two suggestions.
> >
> > (I am trying to avoid bashisms in here.
> > But maybe I overlook some.
> > Also, the code is typed, not sourced from some working script,
> > so there may be logic bugs and typos.
> > My intent should be obvious enough, though.)
> >
> > using "cd /proc/$pid; stat ."
> > -----------------------------
> >
> > # this is most likely linux specific
> > kill_that_pid_and_wait_until_dead()
> > {
> > local pid=$1
> > (
> > cd /proc/$pid || return 0
> > kill -TERM $pid
> > while stat . ; sleep 1; done
> > )
> > return 0
> > }
> >
> > Once pid dies, /proc/$pid will become stale (but not completely go away,
> > because it is our cwd), and stat . will return "No such process".
> >
> > Variants:
> >
> > using test -ef
> > --------------
> >
> > exec 7</proc/$pid || return 0
> > kill -TERM $pid
> > while :; do
> > exec 8</proc/$pid || break
> > test /proc/self/fd/7 -ef /proc/self/fd/8 || break
> > sleep 1
> > done
> > exec 7<&- 8<&-
> >
> > using stat -c %Y /proc/$pid
> > ---------------------------
> >
> > ctime0=$(stat -c %Y /proc/$pid)
> > kill -TERM $pid
> > while ctime=$(stat -c %Y /proc/$pid) && [ $ctime = $ctime0 ] ; do sleep 1; done
> >
> >
> > Why not use the inode number I hear you say.
> > Because it is not stable. Sorry.
> > Don't believe me? Don't want to read kernel source?
> > Try it yourself:
> >
> > sleep 120 & k=$!
> > stat /proc/$k
> > echo 3 > /proc/sys/vm/drop_caches
> > stat /proc/$k
> >
> > But that leads me to an other proposal:
> > store the starttime together with the pid in a pidfile.
> >
> > For linux that would be:
> >
> > (see proc(5) for /proc/pid/stat field meanings.
> > note that (comm) may contain both whitespace and ")",
> > which is the reason for my sed | cut below)
> >
> > spawn_create_exclusive_pid_starttime()
> > {
> > local pidfile=$1
> > shift
> > local reset
> > case $- in *C*) reset=":";; *) set -C; reset="set +C";; esac
> > if ! exec 3>$pidfile ; then
> > $reset
> > return 1
> > fi
> >
> > $reset
> > setsid sh -c '
> > read pid _ < /proc/self/stat
> > starttime=$(sed -e 's/^.*) //' /proc/$pid/stat | cut -d' ' -f 20)
> > >&3 echo $pid $starttime
> > 3>&- exec "$@"
> > ' -- "$@" &
> > return 0
> > }
> >
> > It does not seem possible to cycle through all available pids
> > within fractions of time smaller than the granularity of starttime,
> > so "pid starttime" should be a unique tuple (until the next reboot --
> > at least on linux, starttime is measured as strictly monotonic "uptime").
> >
> >
> > If we have "pid starttime" in the pidfile,
> > we can:
> >
> > get_proc_pid_starttime()
> > {
> > proc_pid_starttime=$(sed -e 's/^.*) //' /proc/$pid/stat) || return 1
> > proc_pid_starttime=$(echo "$proc_pid_starttime" | cut -d' ' -f 20)
> > }
> >
> > kill_using_pidfile()
> > {
> > local pidfile=$1
> > local pid starttime proc_pid_starttime
> >
> > test -e $pidfile || return # already dead
> > read pid starttime <$pidfile || return # unreadable
> >
> > # check pid and starttime are both present, numeric only, ...
> > # I have a version that distinguishes 16 distinct error
> > # conditions; this is the short version only...
> >
> > local i=0
> > while
> > get_proc_pid_starttime &&
> > [ "$starttime" = "$proc_pid_starttime" ]
> > do
> > : $(( i+=1 ))
> > [ $i = 1 ] && kill -TERM $pid
> > # MAYBE # [ $i = 30 ] && kill -KILL $pid
> > sleep 1
> > done
> >
> > # it's not (anymore) the process we where looking for
> > # remove that pidfile.
> >
> > rm -f "$pidfile"
> > }
> >
> > In other OSes, ps may be able to give a good enough equivalent?
> >
> > Any comments?
> >
> > Thanks,
> > Lars
> >
> > _______________________________________________________
> > Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
> > http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
> > Home Page: http://linux-ha.org/
>
> _______________________________________________________
> Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
> http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
> Home Page: http://linux-ha.org/
_______________________________________________________
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/
Re: RFC: pidfile handling; current worst case: stop failure and node level fencing [ In reply to ]
On 10/22/2014 03:33 AM, Dejan Muhamedagic wrote:
> Hi Alan,
>
> On Mon, Oct 20, 2014 at 02:52:13PM -0600, Alan Robertson wrote:
>> For the Assimilation code I use the full pathname of the binary from
>> /proc to tell if it's "one of mine". That's not perfect if you're using
>> an interpreted language. It works quite well for compiled languages.
> Yes, though not perfect, that may be good enough. I supposed that
> the probability that the very same program gets the same recycled
> pid is rather low. (Or is it?)
From my 'C' code I could touch the lock file to match the timestamp of
the /proc/pid/stat (or /proc/pid/exe) symlink -- and verify that they
match. If there is no /proc/pid/stat, then you won't get that extra
safeguard. But as you suggest, it decreases the probability by orders
of magnitude even without the

The /proc/pid/exe symlink appears to have the same timestamp as
/proc/pid/stat

Does anyone know which OSes have either or both of those /proc names?

-- AlanRobertson
alanr@unix.sh


_______________________________________________________
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/
Re: RFC: pidfile handling; current worst case: stop failure and node level fencing [ In reply to ]
On Wed, Oct 22, 2014 at 06:50:37AM -0600, Alan Robertson wrote:
> On 10/22/2014 03:33 AM, Dejan Muhamedagic wrote:
> > Hi Alan,
> >
> > On Mon, Oct 20, 2014 at 02:52:13PM -0600, Alan Robertson wrote:
> >> For the Assimilation code I use the full pathname of the binary from
> >> /proc to tell if it's "one of mine". That's not perfect if you're using
> >> an interpreted language. It works quite well for compiled languages.
> > Yes, though not perfect, that may be good enough. I supposed that
> > the probability that the very same program gets the same recycled
> > pid is rather low. (Or is it?)
> From my 'C' code I could touch the lock file to match the timestamp of
> the /proc/pid/stat (or /proc/pid/exe) symlink -- and verify that they
> match. If there is no /proc/pid/stat, then you won't get that extra
> safeguard. But as you suggest, it decreases the probability by orders
> of magnitude even without the
>
> The /proc/pid/exe symlink appears to have the same timestamp as
> /proc/pid/stat

Hmm, not here:

$ sudo ls -lt /proc/1
...
lrwxrwxrwx 1 root root 0 Aug 27 13:51 exe -> /sbin/init
dr-x------ 2 root root 0 Aug 27 13:51 fd
-r--r--r-- 1 root root 0 Aug 27 13:20 cmdline
-r--r--r-- 1 root root 0 Aug 27 13:18 stat

And the process (init) has been running since July:

$ ps auxw | grep -w [i]nit
root 1 0.0 0.0 10540 780 ? Ss Jul07 1:03 init [3]

Interesting.

> Does anyone know which OSes have either or both of those /proc names?

Nope, not me.

Cheers,

Dejan

> -- AlanRobertson
> alanr@unix.sh
>
>
> _______________________________________________________
> Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
> http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
> Home Page: http://linux-ha.org/
_______________________________________________________
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/
Re: RFC: pidfile handling; current worst case: stop failure and node level fencing [ In reply to ]
On 22/10/14 13:50, Alan Robertson wrote:
> Does anyone know which OSes have either or both of those /proc names?

Once again, can I recommend taking a look at the start-stop-daemon
source (see earlier posting), which does this stuff, and includes checks
for Linux/Hurd/Sun/OpenBSD/FreeBSD/NetBSD/DragonFly, and whilst I've
only ever used it on Linux, at the very least the BSD side seems to be
maintained:

http://anonscm.debian.org/cgit/dpkg/dpkg.git/tree/utils/start-stop-daemon.c

Tim.

--
South East Open Source Solutions Limited
Registered in England and Wales with company number 06134732.
Registered Office: 2 Powell Gardens, Redhill, Surrey, RH1 1TQ
VAT number: 900 6633 53 http://seoss.co.uk/ +44-(0)1273-808309

_______________________________________________________
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/
Re: RFC: pidfile handling; current worst case: stop failure and node level fencing [ In reply to ]
On 10/22/2014 07:09 AM, Dejan Muhamedagic wrote:
> On Wed, Oct 22, 2014 at 06:50:37AM -0600, Alan Robertson wrote:
>> On 10/22/2014 03:33 AM, Dejan Muhamedagic wrote:
>>> Hi Alan,
>>>
>>> On Mon, Oct 20, 2014 at 02:52:13PM -0600, Alan Robertson wrote:
>>>> For the Assimilation code I use the full pathname of the binary from
>>>> /proc to tell if it's "one of mine". That's not perfect if you're using
>>>> an interpreted language. It works quite well for compiled languages.
>>> Yes, though not perfect, that may be good enough. I supposed that
>>> the probability that the very same program gets the same recycled
>>> pid is rather low. (Or is it?)
>> From my 'C' code I could touch the lock file to match the timestamp of
>> the /proc/pid/stat (or /proc/pid/exe) symlink -- and verify that they
>> match. If there is no /proc/pid/stat, then you won't get that extra
>> safeguard. But as you suggest, it decreases the probability by orders
>> of magnitude even without the
>>
>> The /proc/pid/exe symlink appears to have the same timestamp as
>> /proc/pid/stat
> Hmm, not here:
>
> $ sudo ls -lt /proc/1
> ...
> lrwxrwxrwx 1 root root 0 Aug 27 13:51 exe -> /sbin/init
> dr-x------ 2 root root 0 Aug 27 13:51 fd
> -r--r--r-- 1 root root 0 Aug 27 13:20 cmdline
> -r--r--r-- 1 root root 0 Aug 27 13:18 stat
>
> And the process (init) has been running since July:
>
> $ ps auxw | grep -w [i]nit
> root 1 0.0 0.0 10540 780 ? Ss Jul07 1:03 init [3]
>
> Interesting.
And a little worrisome for these strategies...

Here is what I see for timestamps that look to be about the time of
system boot:

-r-------- 1 root root 0 Oct 21 15:42 environ
lrwxrwxrwx 1 root root 0 Oct 21 15:42 root -> /
-r--r--r-- 1 root root 0 Oct 21 15:42 limits
dr-x------ 2 root root 0 Oct 21 15:42 fd
lrwxrwxrwx 1 root root 0 Oct 21 15:42 exe -> /sbin/init
-r--r--r-- 1 root root 0 Oct 21 15:42 stat
-r--r--r-- 1 root root 0 Oct 21 15:42 cgroup
-r--r--r-- 1 root root 0 Oct 21 15:42 cmdline

servidor:/proc/1 $ ls -l /var/log/boot.log
-rw-r--r-- 1 root root 5746 Oct 21 15:42 /var/log/boot.log

servidor:/proc/1 $ ls -ld .
dr-xr-xr-x 9 root root 0 Oct 21 15:42 .

So, you can open file descriptors (fd), change your environment and
cmdline and (soft) limits. You can't change your exe, or root. Cgroup
is new, and I suspect you can't change it. I suspect that the directory
timestamp (/proc//<pid>/) won't change either.

I wonder if it will change on BSD or Solaris or AIX.

/proc info for AIX:

http://www-01.ibm.com/support/knowledgecenter/ssw_aix_61/com.ibm.aix.files/proc.htm
It doesn't say anything about file timestamps.
Solaris info is here:
http://docs.oracle.com/cd/E23824_01/html/821-1473/proc-4.html#scrolltoc
It also doesn't mention timestamps.
FreeBSD is here:
http://www.unix.com/man-page/freebsd/5/procfs/
Re: RFC: pidfile handling; current worst case: stop failure and node level fencing [ In reply to ]
On 10/22/2014 07:11 AM, Tim Small wrote:
> On 22/10/14 13:50, Alan Robertson wrote:
>> Does anyone know which OSes have either or both of those /proc names?
> Once again, can I recommend taking a look at the start-stop-daemon
> source (see earlier posting), which does this stuff, and includes checks
> for Linux/Hurd/Sun/OpenBSD/FreeBSD/NetBSD/DragonFly, and whilst I've
> only ever used it on Linux, at the very least the BSD side seems to be
> maintained:
>
> http://anonscm.debian.org/cgit/dpkg/dpkg.git/tree/utils/start-stop-daemon.c
According to how you described it earlier, it didn't seem to solve the
problems described in this thread. At best it does pretty much exactly
what my previously-implemented solution does.

This discussion has been a bit esoteric. Although my method (and also
start-stop-daemon) are highly unlikely to err, they can make mistakes in
some circumstances.

-- Alan Robertson
alanr@unix.sh
_______________________________________________________
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/
Re: RFC: pidfile handling; current worst case: stop failure and node level fencing [ In reply to ]
On Tue, Oct 21, 2014 at 02:06:24PM +0100, Tim Small wrote:
> On 20/10/14 20:17, Lars Ellenberg wrote:
> > In other OSes, ps may be able to give a good enough equivalent?
>
> Debian's start-stop-daemon executable might be worth considering here -
> it's used extensively in the init script infrastructure of Debian (and
> derivatives, over several different OS kernels), and so is well
> debugged, and in my experience beats re-implementing it's functionality.
>
> http://anonscm.debian.org/cgit/dpkg/dpkg.git/tree/utils/start-stop-daemon.c
>
> I've used it in pacemaker resource control scripts before successfully -
> it's kill expression support is very useful in particular on HA.
>
> Tim.
>
>
> NAME
>
> start-stop-daemon - start and stop system daemon programs

Really? pasting a man page to a mailing list?

But yes...

If we want to require presence of start-stop-daemon,
we could make all this somebody elses problem.
I need find some time to browse through the code
to see if it can be improved further.
But in any case, using (a tool like) start-stop-daemon consistently
throughout all RAs would improve the situation already.

Do we want to do that?
Dejan? David? Anyone?

Lars
_______________________________________________________
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/
Re: RFC: pidfile handling; current worst case: stop failure and node level fencing [ In reply to ]
On Wed, Oct 22, 2014 at 03:09:12PM +0200, Dejan Muhamedagic wrote:
> On Wed, Oct 22, 2014 at 06:50:37AM -0600, Alan Robertson wrote:
> > On 10/22/2014 03:33 AM, Dejan Muhamedagic wrote:
> > > Hi Alan,
> > >
> > > On Mon, Oct 20, 2014 at 02:52:13PM -0600, Alan Robertson wrote:
> > >> For the Assimilation code I use the full pathname of the binary from
> > >> /proc to tell if it's "one of mine". That's not perfect if you're using
> > >> an interpreted language. It works quite well for compiled languages.
> > > Yes, though not perfect, that may be good enough. I supposed that
> > > the probability that the very same program gets the same recycled
> > > pid is rather low. (Or is it?)
> > From my 'C' code I could touch the lock file to match the timestamp of
> > the /proc/pid/stat (or /proc/pid/exe) symlink -- and verify that they
> > match. If there is no /proc/pid/stat, then you won't get that extra
> > safeguard. But as you suggest, it decreases the probability by orders
> > of magnitude even without the
> >
> > The /proc/pid/exe symlink appears to have the same timestamp as
> > /proc/pid/stat
>
> Hmm, not here:
>
> $ sudo ls -lt /proc/1
> ...
> lrwxrwxrwx 1 root root 0 Aug 27 13:51 exe -> /sbin/init
> dr-x------ 2 root root 0 Aug 27 13:51 fd
> -r--r--r-- 1 root root 0 Aug 27 13:20 cmdline
> -r--r--r-- 1 root root 0 Aug 27 13:18 stat


We can not rely on properties of the inodes in /proc/.

These inodes get dropped and recreated as the system sees fit.
and their properties re-initialized to "something".
Ok, the uid/gid is consistent, obviously.
But neither inode numbers or a,m,ctime is "stable".

I demo'ed that in my first email,
I demo it again here:

sleep 120 & k=$! ; stat /proc/$k ; echo 3 > /proc/sys/vm/drop_caches ; sleep 2; find /proc/ -ls &> /dev/null; stat /proc/$k

File: `/proc/8862'
Size: 0 Blocks: 0 IO Block: 1024 directory
Device: 3h/3d Inode: 4295899 Links: 8
Access: (0555/dr-xr-xr-x) Uid: ( 0/ root) Gid: ( 0/ root)
Access: 2014-10-23 18:43:25.535000006 +0200
Modify: 2014-10-23 18:43:25.535000006 +0200
Change: 2014-10-23 18:43:25.535000006 +0200

File: `/proc/8862'
Size: 0 Blocks: 0 IO Block: 1024 directory
Device: 3h/3d Inode: 4296016 Links: 8
Access: (0555/dr-xr-xr-x) Uid: ( 0/ root) Gid: ( 0/ root)
Access: 2014-10-23 18:43:27.561002753 +0200
Modify: 2014-10-23 18:43:27.561002753 +0200
Change: 2014-10-23 18:43:27.561002753 +0200


Note how the inode number and a,m,ctime changes.

the "starttime" I was talking about is the 22nd field of /proc/$pid/stat
see proc(5):
starttime %llu (was %lu before Linux 2.6)
(22) The time the process started after system boot.
In kernels before Linux 2.6, this value was expressed in jiffies.
Since Linux 2.6, the value is expressed in clock ticks
(divide by sysconf(_SC_CLK_TCK)).

Thats a monotonic time counting from system boot.
Which makes it so attractive.
Even if someone fiddles with date --set (or ntp or ...),
even if that would be done on purpose, this field would not care.

Anyways: making this "somebody elses problem",
using (a tool like) start-stop-daemon,
require that to be present,
and help make that do the best thing possible,
and as portable as possible, could be a good way to go.

Still the "cd /proc/$pid", then work from there
would avoid various race conditions nicely.
Where available, open() then openat() will do nicely, as well,
no need to chdir.

So the "quick fix" to solve the issue that triggered the discussion
(not noticing that a pid has died):
is my first suggestion:
# wait for pid to die:
- while kill -0 $pid; do sleep 1; done
+ ( if cd /proc/$pid ; then while test -d . ; do sleep 1; done ; fi ) &> /dev/null


Should we do an ocf-shellfuncs helper for this?
Suggested names?
What should go in there -- only the waiting, the kill TERM?
A timeout? Escalation to kill KILL?

Lars

_______________________________________________________
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/
Re: RFC: pidfile handling; current worst case: stop failure and node level fencing [ In reply to ]
On Wed, Oct 22, 2014 at 02:11:21PM +0100, Tim Small wrote:
> On 22/10/14 13:50, Alan Robertson wrote:
> > Does anyone know which OSes have either or both of those /proc names?
>
> Once again, can I recommend taking a look at the start-stop-daemon
> source (see earlier posting), which does this stuff, and includes checks
> for Linux/Hurd/Sun/OpenBSD/FreeBSD/NetBSD/DragonFly, and whilst I've
> only ever used it on Linux, at the very least the BSD side seems to be
> maintained:
>
> http://anonscm.debian.org/cgit/dpkg/dpkg.git/tree/utils/start-stop-daemon.c

Does not solve the problem I was talking about at all.

If you only have a pid, it has the exact same problem,
it may miss the "pid dead" event because of pid recycling.

If you are more specific (user, parent pid, exe name...)
it becomes less and less likely -- but it would still be possible.

So you may want to add a similar trick to
pid_fd = open(/proc/pid);
.... fstat(pid_fd) ...


Yet an other crazy idea, at least for linux:
do not poll, monitor! --> CONFIG_PROC_EVENTS,
subscribe to CN_IDX_PROC, wait for PROC_EVENT_EXIT

;-)

No, seriously, that's too much trouble to
replace a shell oneliner...
If however that would be added to start-stop-daemon,
it could drop the funky "algorithm" for the kill -0 polling.
(which still would need to be supported, because of old kernels)


Lars

_______________________________________________________
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/
Re: RFC: pidfile handling; current worst case: stop failure and node level fencing [ In reply to ]
On 2014-10-23T20:36:38, Lars Ellenberg <lars.ellenberg@linbit.com> wrote:

> If we want to require presence of start-stop-daemon,
> we could make all this somebody elses problem.
> I need find some time to browse through the code
> to see if it can be improved further.
> But in any case, using (a tool like) start-stop-daemon consistently
> throughout all RAs would improve the situation already.
>
> Do we want to do that?
> Dejan? David? Anyone?

I'm showing my age, but Linux FailSafe had such a tool as well. ;-) So
that might make sense.

Though in Linux nowadays, I wonder if one might not directly want to add
container support to the LRM, or directly use systemd. With a container,
all processes that the RA started would be easily tracked.


Regards,
Lars

--
Architect Storage/HA
SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 21284 (AG Nürnberg)
"Experience is the name everyone gives to their mistakes." -- Oscar Wilde

_______________________________________________________
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/
Re: RFC: pidfile handling; current worst case: stop failure and node level fencing [ In reply to ]
On Thu, Oct 23, 2014 at 09:14:32PM +0200, Lars Ellenberg wrote:
> On Wed, Oct 22, 2014 at 03:09:12PM +0200, Dejan Muhamedagic wrote:
> > On Wed, Oct 22, 2014 at 06:50:37AM -0600, Alan Robertson wrote:
> > > On 10/22/2014 03:33 AM, Dejan Muhamedagic wrote:
> > > > Hi Alan,
> > > >
> > > > On Mon, Oct 20, 2014 at 02:52:13PM -0600, Alan Robertson wrote:
> > > >> For the Assimilation code I use the full pathname of the binary from
> > > >> /proc to tell if it's "one of mine". That's not perfect if you're using
> > > >> an interpreted language. It works quite well for compiled languages.
> > > > Yes, though not perfect, that may be good enough. I supposed that
> > > > the probability that the very same program gets the same recycled
> > > > pid is rather low. (Or is it?)
> > > From my 'C' code I could touch the lock file to match the timestamp of
> > > the /proc/pid/stat (or /proc/pid/exe) symlink -- and verify that they
> > > match. If there is no /proc/pid/stat, then you won't get that extra
> > > safeguard. But as you suggest, it decreases the probability by orders
> > > of magnitude even without the
> > >
> > > The /proc/pid/exe symlink appears to have the same timestamp as
> > > /proc/pid/stat
> >
> > Hmm, not here:
> >
> > $ sudo ls -lt /proc/1
> > ...
> > lrwxrwxrwx 1 root root 0 Aug 27 13:51 exe -> /sbin/init
> > dr-x------ 2 root root 0 Aug 27 13:51 fd
> > -r--r--r-- 1 root root 0 Aug 27 13:20 cmdline
> > -r--r--r-- 1 root root 0 Aug 27 13:18 stat
>
>
> We can not rely on properties of the inodes in /proc/.
>
> These inodes get dropped and recreated as the system sees fit.
> and their properties re-initialized to "something".
> Ok, the uid/gid is consistent, obviously.
> But neither inode numbers or a,m,ctime is "stable".
>
> I demo'ed that in my first email,
> I demo it again here:
>
> sleep 120 & k=$! ; stat /proc/$k ; echo 3 > /proc/sys/vm/drop_caches ; sleep 2; find /proc/ -ls &> /dev/null; stat /proc/$k
>
> File: `/proc/8862'
> Size: 0 Blocks: 0 IO Block: 1024 directory
> Device: 3h/3d Inode: 4295899 Links: 8
> Access: (0555/dr-xr-xr-x) Uid: ( 0/ root) Gid: ( 0/ root)
> Access: 2014-10-23 18:43:25.535000006 +0200
> Modify: 2014-10-23 18:43:25.535000006 +0200
> Change: 2014-10-23 18:43:25.535000006 +0200
>
> File: `/proc/8862'
> Size: 0 Blocks: 0 IO Block: 1024 directory
> Device: 3h/3d Inode: 4296016 Links: 8
> Access: (0555/dr-xr-xr-x) Uid: ( 0/ root) Gid: ( 0/ root)
> Access: 2014-10-23 18:43:27.561002753 +0200
> Modify: 2014-10-23 18:43:27.561002753 +0200
> Change: 2014-10-23 18:43:27.561002753 +0200
>
>
> Note how the inode number and a,m,ctime changes.
>
> the "starttime" I was talking about is the 22nd field of /proc/$pid/stat
> see proc(5):
> starttime %llu (was %lu before Linux 2.6)
> (22) The time the process started after system boot.
> In kernels before Linux 2.6, this value was expressed in jiffies.
> Since Linux 2.6, the value is expressed in clock ticks
> (divide by sysconf(_SC_CLK_TCK)).
>
> Thats a monotonic time counting from system boot.
> Which makes it so attractive.
> Even if someone fiddles with date --set (or ntp or ...),
> even if that would be done on purpose, this field would not care.
>
> Anyways: making this "somebody elses problem",
> using (a tool like) start-stop-daemon,
> require that to be present,
> and help make that do the best thing possible,
> and as portable as possible, could be a good way to go.
>
> Still the "cd /proc/$pid", then work from there
> would avoid various race conditions nicely.
> Where available, open() then openat() will do nicely, as well,
> no need to chdir.
>
> So the "quick fix" to solve the issue that triggered the discussion
> (not noticing that a pid has died):
> is my first suggestion:
> # wait for pid to die:
> - while kill -0 $pid; do sleep 1; done
> + ( if cd /proc/$pid ; then while test -d . ; do sleep 1; done ; fi ) &> /dev/null
>
>
> Should we do an ocf-shellfuncs helper for this?

Yes.

> Suggested names?

There's already a function which is a naive implementation:

ocf_stop_processes

We could modify that one.

> What should go in there -- only the waiting, the kill TERM?
> A timeout? Escalation to kill KILL?

There's already some interface, I suppose we can keep it.
It accepts the list of processes and does

kill ... $pids

Not sure how to handle that. Run all of them in background in a
loop and then wait(1) for them?

Cheers,

Dejan

> Lars
>
> _______________________________________________________
> Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
> http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
> Home Page: http://linux-ha.org/
_______________________________________________________
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/
Re: RFC: pidfile handling; current worst case: stop failure and node level fencing [ In reply to ]
On Thu, Oct 23, 2014 at 08:36:38PM +0200, Lars Ellenberg wrote:
> On Tue, Oct 21, 2014 at 02:06:24PM +0100, Tim Small wrote:
> > On 20/10/14 20:17, Lars Ellenberg wrote:
> > > In other OSes, ps may be able to give a good enough equivalent?
> >
> > Debian's start-stop-daemon executable might be worth considering here -
> > it's used extensively in the init script infrastructure of Debian (and
> > derivatives, over several different OS kernels), and so is well
> > debugged, and in my experience beats re-implementing it's functionality.
> >
> > http://anonscm.debian.org/cgit/dpkg/dpkg.git/tree/utils/start-stop-daemon.c
> >
> > I've used it in pacemaker resource control scripts before successfully -
> > it's kill expression support is very useful in particular on HA.
> >
> > Tim.
> >
> >
> > NAME
> >
> > start-stop-daemon - start and stop system daemon programs
>
> Really? pasting a man page to a mailing list?
>
> But yes...
>
> If we want to require presence of start-stop-daemon,
> we could make all this somebody elses problem.
> I need find some time to browse through the code
> to see if it can be improved further.
> But in any case, using (a tool like) start-stop-daemon consistently
> throughout all RAs would improve the situation already.
>
> Do we want to do that?
> Dejan? David? Anyone?

I think I'm happy with a one-liner shell solution.

Cheers,

Dejan

>
> Lars
> _______________________________________________________
> Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
> http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
> Home Page: http://linux-ha.org/
_______________________________________________________
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/
Re: RFC: pidfile handling; current worst case: stop failure and node level fencing [ In reply to ]
On 10/24/2014 03:32 AM, Lars Marowsky-Bree wrote:
> On 2014-10-23T20:36:38, Lars Ellenberg <lars.ellenberg@linbit.com> wrote:
>
>> If we want to require presence of start-stop-daemon,
>> we could make all this somebody elses problem.
>> I need find some time to browse through the code
>> to see if it can be improved further.
>> But in any case, using (a tool like) start-stop-daemon consistently
>> throughout all RAs would improve the situation already.
>>
>> Do we want to do that?
>> Dejan? David? Anyone?
> I'm showing my age, but Linux FailSafe had such a tool as well. ;-) So
> that might make sense.
>
> Though in Linux nowadays, I wonder if one might not directly want to add
> container support to the LRM, or directly use systemd. With a container,
> all processes that the RA started would be easily tracked.

Process groups do that nicely. The LRM (at least used to) put
everything in a process group.
_______________________________________________________
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/