Mailing List Archive

[RFC][PATCH v2 5/7] taskstats: Improve cumulative CPU time accounting
From: Michael Holzheu <holzheu@linux.vnet.ibm.com>

CHANGE HISTORY OF THIS PATCH
----------------------------
Version 2
---------
* Add cumulative dead thread CPU times to taskstats
* Introduce parallel accounting process hierarchy (based on discussions
with Oleg Nesterov and Roland McGrath)

DESCRIPTION
-----------
Currently the cumulative time accounting in Linux has two major drawbacks
for tools like the top command:

* Due to POSIX POSIX.1-2001, the CPU time of processes is not accounted
to the cumulative time of the parents, if the parents ignore SIGCHLD
or have set SA_NOCLDWAIT. This behaviour has the major drawback that
it is not possible to calculate all consumed CPU time of a system by
looking at the current tasks. CPU time can be lost.

* When a parent process dies, its children get the init process as
new parent. For accounting this is suboptimal, because then init
gets the CPU time of the tasks. For accounting it would be better,
if the CPU time is passed along the relationship tree using the
cumulative time counters as would have happened if the child had died
before the parent. The main benefit of that approach is that between
two tasks snapshots it is always clear which parent got the CPU time
of dead tasks. Otherwise there are situations, where we can't
determine if either init or an older relative of a dead task has
inherited the CPU time.

Another benefit would be that then e.g. it would be possible to look
at the login shell process cumulative times to get all CPU time that has
been consumed by it's children, grandchildren, etc. This would allow
accounting without the need of exit events for all dead processes.
Note that this has a small fault because processes can use
daemonize to detach themselfes from their parents.

This patch adds a new set of cumulative time counters. We then have two
cumulative counter sets:

* cdata_wait: Traditional cumulative time used e.g. by getrusage.
* cdata_acct: Cumulative time that also includes dead processes with
parents that ignore SIGCHLD or have set SA_NOCLDWAIT.
cdata_acct will be exported by taskstats.

Besides of that the patch adds an "acct_parent" pointer next to the parent
pointer and a "children_acct" list next to the children list to the
signal_struct in order to remember the accounting process relationship.

With this patch and the following time fields it is now possible to
calculate all the consumed CPU time of a system in an interval by looking at
the tasks of two taskstats snapshots:

* task->(u/s/st/g-time):
Time that has been consumed by task itself

* task->signal->cdata_acct.(c-u/s/st/g-time):
All time that has been consumed by dead children of process. Includes
also time from processes that reaped themselves, because the parent
ignored SIGCHLD or has set SA_NOCLDWAIT

* task->signal->(u/s/st/g-time):
Time that has been consumed by dead threads of thread group of process

Having this is prerequisite for the following use cases:

USE CASE I: A top command that shows exactly 100% of all consumed CPU time
between two task snapshots without using task exit events. Exit events are not
necessary, because if tasks die between the two snapshots all time can be
found in the cumulative counters of the parent processes or thread group
leaders.

To provide a consistent view, the top tool could show the following fields
(all values are per interval):
* user: task utime
* sys: task stime
* ste: task sttime
* cuser: utime of exited children
* csys: stime of exited children
* cste: sttime of exited children
* xuser: utime of exited threads in thread group
* xsys: stime of exited threads in thread group
* xste: sttime of exited threads in thread group
* total: Sum of all above fields

If the top command notices that a pid disappeared between snapshot 1
and snapshot 2, it has to find its accounting parent and subtract the CPU
times from snapshot 1 of the dead child from the parents cumulative times.

Example:
--------
CPU time of dead tasks in last interval
VVVVV VVVV VVVV
pid user sys ste cuser csys cste xuser xsys xste total Name
(#) (%) (%) (%) (%) (%) (%) (%) (%) (%) (%) (str)
17944 0.10 0.01 0.00 54.29 14.36 0.22 0.0 0.0 0.0 68.98 make
18006 0.10 0.01 0.00 55.79 12.23 0.12 0.0 0.0 0.0 68.26 make
18041 48.18 1.51 0.29 0.00 0.00 0.00 0.0 0.0 0.0 49.98 cc1
...

The sum of all "total" CPU counters on a system that is 100% busy should
be exactly the number CPUs multiplied by the interval time. A good tescase
for this is to start a loop program for each CPU and then in parallel
starting a kernel build with "-j 5".

USE CASE II: Implement precise replacement for "time" command and get all
consumed CPU time of children tree without using exit events.

Example:
--------
# cd linux-2.6
# tstime make -j 3
Cumulative CPU times:
---------------------
user..: 1000266540
system: 121721391
steal.: 3480230

Signed-off-by: Michael Holzheu <holzheu@linux.vnet.ibm.com>
---
fs/binfmt_elf.c | 4 -
fs/proc/array.c | 10 +-
fs/proc/base.c | 3
include/linux/init_task.h | 2
include/linux/sched.h | 39 +++++++---
include/linux/taskstats.h | 6 +
kernel/exit.c | 175 ++++++++++++++++++++++++++++------------------
kernel/fork.c | 7 +
kernel/sys.c | 24 +++---
kernel/tsacct.c | 33 ++++++++
10 files changed, 205 insertions(+), 98 deletions(-)

--- a/fs/binfmt_elf.c
+++ b/fs/binfmt_elf.c
@@ -1296,8 +1296,8 @@ static void fill_prstatus(struct elf_prs
cputime_to_timeval(p->utime, &prstatus->pr_utime);
cputime_to_timeval(p->stime, &prstatus->pr_stime);
}
- cputime_to_timeval(p->signal->cutime, &prstatus->pr_cutime);
- cputime_to_timeval(p->signal->cstime, &prstatus->pr_cstime);
+ cputime_to_timeval(p->signal->cdata_wait.cutime, &prstatus->pr_cutime);
+ cputime_to_timeval(p->signal->cdata_wait.cstime, &prstatus->pr_cstime);
}

static int fill_psinfo(struct elf_prpsinfo *psinfo, struct task_struct *p,
--- a/fs/proc/array.c
+++ b/fs/proc/array.c
@@ -413,11 +413,11 @@ static int do_task_stat(struct seq_file
num_threads = get_nr_threads(task);
collect_sigign_sigcatch(task, &sigign, &sigcatch);

- cmin_flt = sig->cmin_flt;
- cmaj_flt = sig->cmaj_flt;
- cutime = sig->cutime;
- cstime = sig->cstime;
- cgtime = sig->cgtime;
+ cmin_flt = sig->cdata_wait.cmin_flt;
+ cmaj_flt = sig->cdata_wait.cmaj_flt;
+ cutime = sig->cdata_wait.cutime;
+ cstime = sig->cdata_wait.cstime;
+ cgtime = sig->cdata_wait.cgtime;
rsslim = ACCESS_ONCE(sig->rlim[RLIMIT_RSS].rlim_cur);

/* add up live thread stats at the group level */
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -2671,7 +2671,8 @@ static int do_io_accounting(struct task_
if (whole && lock_task_sighand(task, &flags)) {
struct task_struct *t = task;

- task_io_accounting_add(&acct, &task->signal->ioac);
+ task_io_accounting_add(&acct,
+ &task->signal->cdata_wait.ioac);
while_each_thread(task, t)
task_io_accounting_add(&acct, &t->ioac);

--- a/include/linux/init_task.h
+++ b/include/linux/init_task.h
@@ -31,6 +31,8 @@ extern struct fs_struct init_fs;
}, \
.cred_guard_mutex = \
__MUTEX_INITIALIZER(sig.cred_guard_mutex), \
+ .acct_parent = NULL, \
+ .acct_children = LIST_HEAD_INIT(sig.acct_children), \
}

extern struct nsproxy init_nsproxy;
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -512,6 +512,20 @@ struct thread_group_cputimer {
};

/*
+ * Cumulative resource counters for reaped dead child processes.
+ * Live threads maintain their own counters and add to these
+ * in __exit_signal, except for the group leader.
+ */
+struct cdata {
+ cputime_t cutime, cstime, csttime, cgtime;
+ unsigned long cnvcsw, cnivcsw;
+ unsigned long cmin_flt, cmaj_flt;
+ unsigned long cinblock, coublock;
+ unsigned long cmaxrss;
+ struct task_io_accounting ioac;
+};
+
+/*
* NOTE! "signal_struct" does not have it's own
* locking, because a shared signal_struct always
* implies a shared sighand_struct, so locking
@@ -578,23 +592,28 @@ struct signal_struct {

struct tty_struct *tty; /* NULL if no tty */

+ /* Cumulative resource counters for all dead child processes */
+ struct cdata cdata_wait; /* parents have done sys_wait() */
+ struct cdata cdata_acct; /* complete cumulative data from acct tree */
+
+ /* Parallel accounting tree */
+ struct signal_struct *acct_parent; /* accounting parent process */
+ struct list_head acct_children; /* list of my accounting children */
+ struct list_head acct_sibling; /* linkage in my parent's list */
+
/*
* Cumulative resource counters for dead threads in the group,
- * and for reaped dead child processes forked by this group.
- * Live threads maintain their own counters and add to these
- * in __exit_signal, except for the group leader.
*/
- cputime_t utime, stime, sttime, cutime, cstime, csttime;
+ cputime_t utime, stime, sttime;
+
cputime_t gtime;
- cputime_t cgtime;
#ifndef CONFIG_VIRT_CPU_ACCOUNTING
cputime_t prev_utime, prev_stime, prev_sttime;
#endif
- unsigned long nvcsw, nivcsw, cnvcsw, cnivcsw;
- unsigned long min_flt, maj_flt, cmin_flt, cmaj_flt;
- unsigned long inblock, oublock, cinblock, coublock;
- unsigned long maxrss, cmaxrss;
- struct task_io_accounting ioac;
+ unsigned long nvcsw, nivcsw;
+ unsigned long min_flt, maj_flt;
+ unsigned long inblock, oublock;
+ unsigned long maxrss;

/*
* Cumulative ns of schedule CPU time fo dead threads in the
--- a/include/linux/taskstats.h
+++ b/include/linux/taskstats.h
@@ -169,6 +169,12 @@ struct taskstats {
__u64 time_ns;
__u32 ac_tgid; /* Thread group ID */
__u64 ac_sttime; /* Steal CPU time [usec] */
+ __u64 ac_cutime; /* User CPU time of children [usec] */
+ __u64 ac_cstime; /* System CPU time of children [usec] */
+ __u64 ac_csttime; /* Steal CPU time of children [usec] */
+ __u64 ac_tutime; /* User CPU time of threads [usec] */
+ __u64 ac_tstime; /* System CPU time of threads [usec] */
+ __u64 ac_tsttime; /* Steal CPU time of threads [usec] */
};


--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -51,6 +51,7 @@
#include <trace/events/sched.h>
#include <linux/hw_breakpoint.h>
#include <linux/oom.h>
+#include <linux/kernel_stat.h>

#include <asm/uaccess.h>
#include <asm/unistd.h>
@@ -74,6 +75,85 @@ static void __unhash_process(struct task
list_del_rcu(&p->thread_group);
}

+static void __account_ctime(struct task_struct *p, struct task_struct *parent,
+ struct cdata *pcd, struct cdata *ccd)
+{
+ struct signal_struct *sig = p->signal;
+ cputime_t tgutime, tgstime, tgsttime;
+ unsigned long maxrss, flags;
+
+ /*
+ * The resource counters for the group leader are in its
+ * own task_struct. Those for dead threads in the group
+ * are in its signal_struct, as are those for the child
+ * processes it has previously reaped. All these
+ * accumulate in the parent's signal_struct c* fields.
+ *
+ * We don't bother to take a lock here to protect these
+ * p->signal fields, because they are only touched by
+ * __exit_signal, which runs with tasklist_lock
+ * write-locked anyway, and so is excluded here. We do
+ * need to protect the access to parent->signal fields,
+ * as other threads in the parent group can be right
+ * here reaping other children at the same time.
+ *
+ * We use thread_group_times() to get times for the thread
+ * group, which consolidates times for all threads in the
+ * group including the group leader.
+ */
+ thread_group_times(p, &tgutime, &tgstime, &tgsttime);
+
+ spin_lock_irqsave(&parent->sighand->siglock, flags);
+ pcd->cutime = cputime_add(pcd->cutime,
+ cputime_add(tgutime, ccd->cutime));
+ pcd->cstime = cputime_add(pcd->cstime,
+ cputime_add(tgstime, ccd->cstime));
+ pcd->csttime = cputime_add(pcd->csttime,
+ cputime_add(tgsttime, ccd->csttime));
+ pcd->cgtime = cputime_add(pcd->cgtime, cputime_add(p->gtime,
+ cputime_add(sig->gtime, ccd->cgtime)));
+
+ pcd->cmin_flt += p->min_flt + sig->min_flt + ccd->cmin_flt;
+ pcd->cmaj_flt += p->maj_flt + sig->maj_flt + ccd->cmaj_flt;
+ pcd->cnvcsw += p->nvcsw + sig->nvcsw + ccd->cnvcsw;
+ pcd->cnivcsw += p->nivcsw + sig->nivcsw + ccd->cnivcsw;
+ pcd->cinblock += task_io_get_inblock(p) + sig->inblock + ccd->cinblock;
+ pcd->coublock += task_io_get_oublock(p) + sig->oublock + ccd->coublock;
+ maxrss = max(sig->maxrss, ccd->cmaxrss);
+ if (pcd->cmaxrss < maxrss)
+ pcd->cmaxrss = maxrss;
+
+ maxrss = max(sig->maxrss, ccd->cmaxrss);
+ if (pcd->cmaxrss < maxrss)
+ pcd->cmaxrss = maxrss;
+
+ task_io_accounting_add(&pcd->ioac, &p->ioac);
+ task_io_accounting_add(&pcd->ioac, &ccd->ioac);
+ task_io_accounting_add(&pcd->ioac, &ccd->ioac);
+ spin_unlock_irqrestore(&parent->sighand->siglock, flags);
+}
+
+static void reparent_acct(struct signal_struct *father)
+{
+ struct signal_struct *c, *n, *new_parent;
+ struct task_struct *tsk;
+
+ rcu_read_lock();
+ tsk = pid_task(father->leader_pid, PIDTYPE_PID);
+ if (!list_empty(&father->acct_children))
+ printk("XXX forget: %i (%s)\n", tsk->pid, tsk->comm);
+ new_parent = father->acct_parent;
+ list_for_each_entry_safe(c, n, &father->acct_children, acct_sibling) {
+ struct task_struct *ctsk = pid_task(c->leader_pid, PIDTYPE_PID);
+ struct task_struct *ptsk = pid_task(new_parent->leader_pid, PIDTYPE_PID);
+ printk(" XXX move: %i (%s) to %i (%s) %p\n",
+ ctsk->pid, ctsk->comm, ptsk->pid, ptsk->comm, new_parent);
+ c->acct_parent = new_parent;
+ list_move_tail(&c->acct_sibling, &new_parent->acct_children);
+ }
+ rcu_read_unlock();
+}
+
/*
* This function expects the tasklist_lock write-locked.
*/
@@ -84,6 +164,29 @@ static void __exit_signal(struct task_st
struct sighand_struct *sighand;
struct tty_struct *uninitialized_var(tty);

+#ifdef __s390x__
+ /*
+ * FIXME: On s390 we can call account_process_tick to update
+ * CPU time information. This is probably not valid on other
+ * architectures.
+ */
+ if (current == tsk)
+ account_process_tick(current, 1);
+#endif
+ if (group_dead) {
+ struct task_struct *acct_parent =
+ pid_task(sig->acct_parent->leader_pid, PIDTYPE_PID);
+ /*
+ * FIXME: This somehow has to be moved to
+ * finish_task_switch(), because otherwise
+ * if the process accounts itself, the CPU time
+ * that is used for this code will be lost.
+ */
+ __account_ctime(tsk, acct_parent, &sig->acct_parent->cdata_acct,
+ &sig->cdata_acct);
+ reparent_acct(sig);
+ list_del_init(&sig->acct_sibling);
+ }
sighand = rcu_dereference_check(tsk->sighand,
rcu_read_lock_held() ||
lockdep_tasklist_lock_is_held());
@@ -132,7 +235,8 @@ static void __exit_signal(struct task_st
sig->nivcsw += tsk->nivcsw;
sig->inblock += task_io_get_inblock(tsk);
sig->oublock += task_io_get_oublock(tsk);
- task_io_accounting_add(&sig->ioac, &tsk->ioac);
+ task_io_accounting_add(&sig->cdata_wait.ioac,
+ &tsk->ioac);
sig->sum_sched_runtime += tsk->se.sum_exec_runtime;
}

@@ -1226,73 +1330,10 @@ static int wait_task_zombie(struct wait_
* !task_detached() to filter out sub-threads.
*/
if (likely(!traced) && likely(!task_detached(p))) {
- struct signal_struct *psig;
- struct signal_struct *sig;
- unsigned long maxrss;
- cputime_t tgutime, tgstime, tgsttime;
-
- /*
- * The resource counters for the group leader are in its
- * own task_struct. Those for dead threads in the group
- * are in its signal_struct, as are those for the child
- * processes it has previously reaped. All these
- * accumulate in the parent's signal_struct c* fields.
- *
- * We don't bother to take a lock here to protect these
- * p->signal fields, because they are only touched by
- * __exit_signal, which runs with tasklist_lock
- * write-locked anyway, and so is excluded here. We do
- * need to protect the access to parent->signal fields,
- * as other threads in the parent group can be right
- * here reaping other children at the same time.
- *
- * We use thread_group_times() to get times for the thread
- * group, which consolidates times for all threads in the
- * group including the group leader.
- */
- thread_group_times(p, &tgutime, &tgstime, &tgsttime);
- spin_lock_irq(&p->real_parent->sighand->siglock);
- psig = p->real_parent->signal;
- sig = p->signal;
- psig->cutime =
- cputime_add(psig->cutime,
- cputime_add(tgutime,
- sig->cutime));
- psig->cstime =
- cputime_add(psig->cstime,
- cputime_add(tgstime,
- sig->cstime));
- psig->csttime =
- cputime_add(psig->csttime,
- cputime_add(tgsttime,
- sig->csttime));
- psig->cgtime =
- cputime_add(psig->cgtime,
- cputime_add(p->gtime,
- cputime_add(sig->gtime,
- sig->cgtime)));
- psig->cmin_flt +=
- p->min_flt + sig->min_flt + sig->cmin_flt;
- psig->cmaj_flt +=
- p->maj_flt + sig->maj_flt + sig->cmaj_flt;
- psig->cnvcsw +=
- p->nvcsw + sig->nvcsw + sig->cnvcsw;
- psig->cnivcsw +=
- p->nivcsw + sig->nivcsw + sig->cnivcsw;
- psig->cinblock +=
- task_io_get_inblock(p) +
- sig->inblock + sig->cinblock;
- psig->coublock +=
- task_io_get_oublock(p) +
- sig->oublock + sig->coublock;
- maxrss = max(sig->maxrss, sig->cmaxrss);
- if (psig->cmaxrss < maxrss)
- psig->cmaxrss = maxrss;
- task_io_accounting_add(&psig->ioac, &p->ioac);
- task_io_accounting_add(&psig->ioac, &sig->ioac);
- spin_unlock_irq(&p->real_parent->sighand->siglock);
+ __account_ctime(p, p->real_parent,
+ &p->real_parent->signal->cdata_wait,
+ &p->signal->cdata_wait);
}
-
/*
* Now we are sure this task is interesting, and no other
* thread can reap it because we set its state to EXIT_DEAD.
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1289,6 +1289,13 @@ static struct task_struct *copy_process(
nr_threads++;
}

+ if (!(clone_flags & CLONE_THREAD)) {
+ p->signal->acct_parent = current->signal;
+ INIT_LIST_HEAD(&p->signal->acct_children);
+ INIT_LIST_HEAD(&p->signal->acct_sibling);
+ list_add_tail(&p->signal->acct_sibling,
+ &current->signal->acct_children);
+ }
total_forks++;
spin_unlock(&current->sighand->siglock);
write_unlock_irq(&tasklist_lock);
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -884,8 +884,8 @@ void do_sys_times(struct tms *tms)

spin_lock_irq(&current->sighand->siglock);
thread_group_times(current, &tgutime, &tgstime, &tgsttime);
- cutime = current->signal->cutime;
- cstime = current->signal->cstime;
+ cutime = current->signal->cdata_wait.cutime;
+ cstime = current->signal->cdata_wait.cstime;
spin_unlock_irq(&current->sighand->siglock);
tms->tms_utime = cputime_to_clock_t(tgutime);
tms->tms_stime = cputime_to_clock_t(tgstime);
@@ -1490,6 +1490,7 @@ static void k_getrusage(struct task_stru
unsigned long flags;
cputime_t tgutime, tgstime, tgsttime, utime, stime, sttime;
unsigned long maxrss = 0;
+ struct cdata *cd;

memset((char *) r, 0, sizeof *r);
utime = stime = cputime_zero;
@@ -1507,15 +1508,16 @@ static void k_getrusage(struct task_stru
switch (who) {
case RUSAGE_BOTH:
case RUSAGE_CHILDREN:
- utime = p->signal->cutime;
- stime = p->signal->cstime;
- r->ru_nvcsw = p->signal->cnvcsw;
- r->ru_nivcsw = p->signal->cnivcsw;
- r->ru_minflt = p->signal->cmin_flt;
- r->ru_majflt = p->signal->cmaj_flt;
- r->ru_inblock = p->signal->cinblock;
- r->ru_oublock = p->signal->coublock;
- maxrss = p->signal->cmaxrss;
+ cd = &p->signal->cdata_wait;
+ utime = cd->cutime;
+ stime = cd->cstime;
+ r->ru_nvcsw = cd->cnvcsw;
+ r->ru_nivcsw = cd->cnivcsw;
+ r->ru_minflt = cd->cmin_flt;
+ r->ru_majflt = cd->cmaj_flt;
+ r->ru_inblock = cd->cinblock;
+ r->ru_oublock = cd->coublock;
+ maxrss = cd->cmaxrss;

if (who == RUSAGE_CHILDREN)
break;
--- a/kernel/tsacct.c
+++ b/kernel/tsacct.c
@@ -61,8 +61,37 @@ void bacct_add_tsk(struct taskstats *sta
tcred = __task_cred(tsk);
stats->ac_uid = tcred->uid;
stats->ac_gid = tcred->gid;
- stats->ac_ppid = pid_alive(tsk) ?
- rcu_dereference(tsk->real_parent)->tgid : 0;
+ /*
+ * FIXME: Add new field "ac_acct_ppid" and "ac_acct_ctime" to
+ * taskstats
+ */
+ if (pid_alive(tsk) && tsk->signal) {
+ struct signal_struct *acct_parent = tsk->signal->acct_parent;
+ if (acct_parent && acct_parent->leader_pid)
+ stats->ac_ppid = pid_task(acct_parent->leader_pid,
+ PIDTYPE_PID)->tgid;
+ else
+ stats->ac_ppid = 0;
+ } else {
+ stats->ac_ppid = 0;
+ }
+ if (tsk->signal && tsk->tgid == tsk->pid) {
+ struct cdata *cd = &tsk->signal->cdata_acct;
+
+ stats->ac_cutime = cputime_to_usecs(cd->cutime);
+ stats->ac_cstime = cputime_to_usecs(cd->cstime);
+ stats->ac_csttime = cputime_to_usecs(cd->csttime);
+ stats->ac_tutime = cputime_to_usecs(tsk->signal->utime);
+ stats->ac_tstime = cputime_to_usecs(tsk->signal->stime);
+ stats->ac_tsttime = cputime_to_usecs(tsk->signal->sttime);
+ } else {
+ stats->ac_cutime = 0;
+ stats->ac_cstime = 0;
+ stats->ac_csttime = 0;
+ stats->ac_tutime = 0;
+ stats->ac_tstime = 0;
+ stats->ac_tsttime = 0;
+ }
rcu_read_unlock();
stats->ac_utime = cputime_to_usecs(tsk->utime);
stats->ac_stime = cputime_to_usecs(tsk->stime);

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH v2 5/7] taskstats: Improve cumulative CPU time accounting [ In reply to ]
First of all, let me repeat, I am not going to discuss whether we need
these changes or not, I do not know even if I understand your motivation.

My only concern is correctness, but

On 11/11, Michael Holzheu wrote:
>
> * Add cumulative dead thread CPU times to taskstats
> * Introduce parallel accounting process hierarchy (based on discussions
> with Oleg Nesterov and Roland McGrath)

Michael, I really think this patch does too many different things.
I forgot the details of our previous discussion, and I got lost
trying to understand the new version.

I already asked you to split these changes, perhaps you can do this?
Say, bacct_add_tsk() looks overcomplicated, the change in copy_process()
shouldn't introduce the new CLONE_THREAD check, not sure I understand
why release_task() was chosen for reparenting, other nits...

But it is not easy to discuss these completely different things
looking at the single patch.

Imho, it would be much better if you make a separate patch which
adds acct_parent/etc and implements the parallel hierarchy. This
also makes sense because it touches the core kernel.

Personally I think that even "struct cdata" + __account_ctime() helper
needs a separate patch, and perhaps this change makes sense by itself
as cleanup. And this way the "trivial" changes (like the changes in
k_getrusage) won't distract from the functional changes.

The final change should introduce cdata_acct and actually implement
the complete cumulative accounting.

At least, please distill parallel accounting process hierarchy.
We do not want bugs in this area ;)

What do you think?

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH v2 5/7] taskstats: Improve cumulative CPU time accounting [ In reply to ]
On Sat, 13 Nov 2010 19:38:10 +0100
Oleg Nesterov <oleg@redhat.com> wrote:

> First of all, let me repeat, I am not going to discuss whether we need
> these changes or not, I do not know even if I understand your motivation.

Hmm, is the motivation of whole patch series unclear or just the details
in this one? What we want is a low-overhead tool that precisely shows
where the cpu spent its time (or didn't because of steal time). The
granularity target is tenths of microseconds, something that should be
possible with decent hardware.
What makes life hard is the corner case of a task that has been reparented
to init. That is reason for this complicated secondary accounting structure.

> My only concern is correctness, but

You review effort to improve correctness is very much appreciated. We'll
try to split the patches to make review easier.

--
blue skies,
Martin.

"Reality continues to ruin my life." - Calvin.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH v2 5/7] taskstats: Improve cumulative CPU time accounting [ In reply to ]
On Mon, 2010-11-15 at 16:55 +0100, Martin Schwidefsky wrote:
> What we want is a low-overhead tool that precisely shows
> where the cpu spent its time (or didn't because of steal time). The
> granularity target is tenths of microseconds, something that should be
> possible with decent hardware.

To what purpose?

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH v2 5/7] taskstats: Improve cumulative CPU time accounting [ In reply to ]
On Mon, 15 Nov 2010 17:03:56 +0100
Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:

> On Mon, 2010-11-15 at 16:55 +0100, Martin Schwidefsky wrote:
> > What we want is a low-overhead tool that precisely shows
> > where the cpu spent its time (or didn't because of steal time). The
> > granularity target is tenths of microseconds, something that should be
> > possible with decent hardware.
>
> To what purpose?

Is that a trick question? Why do we have tools like "top"? Or process
accounting? The point is that the quality of the numbers we get right
now is rather bad, the overhead of scanning /proc is horrendous and
the 10ms granularity is rather coarse.

--
blue skies,
Martin.

"Reality continues to ruin my life." - Calvin.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH v2 5/7] taskstats: Improve cumulative CPU time accounting [ In reply to ]
On Mon, 2010-11-15 at 18:49 +0100, Martin Schwidefsky wrote:
> > To what purpose?
>
> Is that a trick question? Why do we have tools like "top"? Or process
> accounting? The point is that the quality of the numbers we get right
> now is rather bad, the overhead of scanning /proc is horrendous and
> the 10ms granularity is rather coarse.

But you're not just replacing top, you're adding all kinds of new
accounting crap all over the place.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH v2 5/7] taskstats: Improve cumulative CPU time accounting [ In reply to ]
On Mon, 15 Nov 2010 18:51:57 +0100
Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:

> On Mon, 2010-11-15 at 18:49 +0100, Martin Schwidefsky wrote:
> > > To what purpose?
> >
> > Is that a trick question? Why do we have tools like "top"? Or process
> > accounting? The point is that the quality of the numbers we get right
> > now is rather bad, the overhead of scanning /proc is horrendous and
> > the 10ms granularity is rather coarse.
>
> But you're not just replacing top, you're adding all kinds of new
> accounting crap all over the place.

We DO replace top. Patch #7 of 7.

--
blue skies,
Martin.

"Reality continues to ruin my life." - Calvin.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH v2 5/7] taskstats: Improve cumulative CPU time accounting [ In reply to ]
On Mon, 2010-11-15 at 19:00 +0100, Martin Schwidefsky wrote:
> On Mon, 15 Nov 2010 18:51:57 +0100
> Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
>
> > On Mon, 2010-11-15 at 18:49 +0100, Martin Schwidefsky wrote:
> > > > To what purpose?
> > >
> > > Is that a trick question? Why do we have tools like "top"? Or process
> > > accounting? The point is that the quality of the numbers we get right
> > > now is rather bad, the overhead of scanning /proc is horrendous and
> > > the 10ms granularity is rather coarse.
> >
> > But you're not just replacing top, you're adding all kinds of new
> > accounting crap all over the place.
>
> We DO replace top. Patch #7 of 7.

You _also_ replace top, but its not by far the only thing you do. If you
simply replaced top you wouldn't need to add tons of extra accounting,
would you?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH v2 5/7] taskstats: Improve cumulative CPU time accounting [ In reply to ]
On Mon, 15 Nov 2010 19:10:06 +0100
Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:

> On Mon, 2010-11-15 at 19:00 +0100, Martin Schwidefsky wrote:
> > On Mon, 15 Nov 2010 18:51:57 +0100
> > Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> >
> > > On Mon, 2010-11-15 at 18:49 +0100, Martin Schwidefsky wrote:
> > > > > To what purpose?
> > > >
> > > > Is that a trick question? Why do we have tools like "top"? Or process
> > > > accounting? The point is that the quality of the numbers we get right
> > > > now is rather bad, the overhead of scanning /proc is horrendous and
> > > > the 10ms granularity is rather coarse.
> > >
> > > But you're not just replacing top, you're adding all kinds of new
> > > accounting crap all over the place.
> >
> > We DO replace top. Patch #7 of 7.
>
> You _also_ replace top, but its not by far the only thing you do. If you
> simply replaced top you wouldn't need to add tons of extra accounting,
> would you?

There are basically two things we want to accomplish:
1) Make top faster by replacing the /proc based data gathering with taskstats.
To really make it go fast with taskstats we filter with last_depart to
read only tasks that have changed since the last snapshot.
2) Make top more precise. That is where all of the extra accounting comes
into play.

--
blue skies,
Martin.

"Reality continues to ruin my life." - Calvin.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH v2 5/7] taskstats: Improve cumulative CPU time accounting [ In reply to ]
Hello Oleg,

On Sat, 2010-11-13 at 19:38 +0100, Oleg Nesterov wrote:
> First of all, let me repeat, I am not going to discuss whether we need
> these changes or not, I do not know even if I understand your motivation.

Sorry, if I was not clear enough with my descriptions. Let me try to
describe my motivation again:

The cumulative accounting patch implements an infrastructure that makes
it possible to collect all CPU time usage between two task snapshots
without using task exit events. The main idea is to show cumulative CPU
time fields for each task in the top command that contain the CPU time
of children that died in the last interval.

Example 1 (simple case):
A "make" process forked several gcc processes after snapshot 1 and all
of them exited before snapshot 2. We subtract the cumulative CPU times
of "make" of snapshot 2 from the cumulative times of "make" of snapshot
1. The result will be the consumed CPU time of the dead gcc processes in
the last interval. The value is that we can see in top that "make" is
"responsible" for this CPU time.

Example 2:
We have the gcc processes in snapshot 1 but not in snapshot 2. Then the
top command has to find the nearest relative (e.g. the parent process)
that is still alive in snapshot 2, create the delta of the cumulative
time for this process and subtract the CPU time of the gcc processes of
snapshot 1. This gives you the CPU time that was consumed by the gcc
processes between snapshot 1 and 2.

With your help we identified two problems that make this approach
impossible or at least not exact with the current Linux implementation:
1. Cumulative CPU time counters are not complete (SA_NOCLDWAIT)
2. Because of reparent to init, there are situations where it is
not clear to which tasks the CPU time of dead tasks between
two snapshots has been accounted. This is a problem for example 2.

The patch tries to solve the problem by adding a second set of
cumulative data that contains all CPU time of dead children and adds the
parallel hierarchy to make it unambiguous which parent got the CPU time
of dead tasks (needed for example 2).

I hope that you understand now the value and the motivation of fixing
the two problems.

I know that new userspace APIs should be added with care and should be
avoided when the value of a new function is not big enough. I also can
understand the objections from Peter and your concerns. So is the value
of the new function big enough?

I will answer your other technical comments in a separate mail.

Michael

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH v2 5/7] taskstats: Improve cumulative CPU time accounting [ In reply to ]
Hello Oleg,

On Sat, 2010-11-13 at 19:38 +0100, Oleg Nesterov wrote:
> I already asked you to split these changes, perhaps you can do this?
> Say, bacct_add_tsk() looks overcomplicated, the change in copy_process()
> shouldn't introduce the new CLONE_THREAD check, not sure I understand
> why release_task() was chosen for reparenting, other nits...

I want to establish the new hierarchy when a new process is forked and
not for new threads, therefore the check for CLONE_THREAD in
copy_process(). I do the reparenting with reparent_acct() when a process
dies, therefore the check for "group_dead" in exit_signal().

> But it is not easy to discuss these completely different things
> looking at the single patch.
>
> Imho, it would be much better if you make a separate patch which
> adds acct_parent/etc and implements the parallel hierarchy. This
> also makes sense because it touches the core kernel.
>
> Personally I think that even "struct cdata" + __account_ctime() helper
> needs a separate patch, and perhaps this change makes sense by itself
> as cleanup. And this way the "trivial" changes (like the changes in
> k_getrusage) won't distract from the functional changes.
>
> The final change should introduce cdata_acct and actually implement
> the complete cumulative accounting.

So you want to have the following three patches:
* Introduce "struct cdata" + __account_ctime() (no functional change)
* Add cdata_acct accounting + parallel accounting hierarchy
* Add taskstats interface to export the data to userspace

Correct?

Michael

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH v2 5/7] taskstats: Improve cumulative CPU time accounting [ In reply to ]
Hi Michael,

I can't read this discussion today, will try to do tomorrow and reply.
Just one note,

On 11/16, Michael Holzheu wrote:
>
> So you want to have the following three patches:
> * Introduce "struct cdata" + __account_ctime() (no functional change)
> * Add cdata_acct accounting + parallel accounting hierarchy
> * Add taskstats interface to export the data to userspace

Something like this (if possible).

But my main concern is parallel hierarchy itself. So, if you ask me,
I'd prefer to see a separate patch which only adds acct_parent/etc
and implements this reparenting, even without cdata_acct accounting.

Again, if you think this is possible. I understand that it is not
easy to maintain this series,

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH v2 5/7] taskstats: Improve cumulative CPU time accounting [ In reply to ]
On 11/16, Michael Holzheu wrote:
>
> On Sat, 2010-11-13 at 19:38 +0100, Oleg Nesterov wrote:
> > I already asked you to split these changes, perhaps you can do this?
> > Say, bacct_add_tsk() looks overcomplicated, the change in copy_process()
> > shouldn't introduce the new CLONE_THREAD check, not sure I understand
> > why release_task() was chosen for reparenting, other nits...
>
> I want to establish the new hierarchy when a new process is forked and
> not for new threads, therefore the check for CLONE_THREAD in
> copy_process().

Yes, but copy_process() already checks CLONE_THREAD many times. No
need to introduce the new check.

> I do the reparenting with reparent_acct() when a process
> dies, therefore the check for "group_dead" in exit_signal().

And it is not clear to me why release_task() is better than
exit_notify().


That said, perhaps I'll understand this reading the next version.
That is why I asked to split.

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH v2 5/7] taskstats: Improve cumulative CPU time accounting [ In reply to ]
Sorry for delay.

On 11/16, Michael Holzheu wrote:
>
> On Sat, 2010-11-13 at 19:38 +0100, Oleg Nesterov wrote:
> > First of all, let me repeat, I am not going to discuss whether we need
> > these changes or not, I do not know even if I understand your motivation.
>
> Sorry, if I was not clear enough with my descriptions. Let me try to
> describe my motivation again:

Yes, more or less I see what this patch does (I hope ;).

> 2. Because of reparent to init, there are situations where it is
> not clear to which tasks the CPU time of dead tasks between
> two snapshots has been accounted. This is a problem for example 2.

Yes, I see.

But I must admit, _personally_ I am not sure this problem worth the
trouble. And, I already mentioned daemonize() case, IOW I am not sure
it is _always_ right to choose exiting_parent->parent for accounting.
To me, this can be equally confusing. A user sees the running deamon
with ppid = 1, then this daemon exits and top reports the "unrelated"
process as cpu consumer.


But once again. I am _not_ against this patch. I never know when it
comes to new features. Certainly you know better if this suits top.

What I actually meant is: dear CC list, please look at this change
and comment ;)

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH v2 5/7] taskstats: Improve cumulative CPU time accounting [ In reply to ]
Hello Oleg,

On Thu, 2010-11-18 at 18:10 +0100, Oleg Nesterov wrote:
> > 2. Because of reparent to init, there are situations where it is
> > not clear to which tasks the CPU time of dead tasks between
> > two snapshots has been accounted. This is a problem for example 2.
>
> Yes, I see.
>
> But I must admit, _personally_ I am not sure this problem worth the
> trouble.

I think you are right...

For that function, the introduced overhead, additional code and
especially the possible confusion with two process hierarchies is not
worth the effort. Maybe we have a chance to solve the reparent problem
by introducing task exit event filters (e.g. give me all exit events for
processes that have init as parent).

So I will send no more patches with the parallel hierarchy:
Good news for you :-)

But for the second problem with the forgotten CPU time, I would like
send you a new patch set, separated as you have requested. Although I
personally think that also there we probably have no good chances to get
them accepted upstream, because the signal_struct size is increased and
some cycles are added to the task exit processing.

It would be nice, if you find some time to look at the patches,
but no hurry!

Michael





--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/