Mailing List Archive: [patch 0/6] lightweight robust futexes: -V3

Re: [patch 0/6] lightweight robust futexes: -V3 [ In reply to ]

Feb 16, 2006, 8:56 PM

Post #26 of 30 (489 views)

> in this particular case i dont think it could be described in a more
> generic way. I'm not against your idea per se - but someone would have
> to code it up ;)

I wasn't talking about 'automation' code, nor about 'more flexible
or generic' code, nor any other changes or additions to your code,
but rather about documentation that spelled out the ABI explicitly,
independent of kernel or glibc code.

Apparently my question was confusing, as you seem to have answered
a different question than I thought I asked. Good answers, to some
other question.

Let me try again, from the beginning.

First, let me point out the tight coupling of this patch set, at least
as currently presented, with glibc. Notice for example the following
comment from your patch:

+ * NOTE: this structure is part of the syscall ABI, and must only be
+ * changed if the change is first communicated with the glibc folks.

Perhaps I am being an old fogey, reflecting times and systems long
past their prime, but I'd have thought that it would be better if
this interface was not so much a contract written in C code between
the kernel and glibc (which is how the above comment sounds) but
rather a contract between the kernel code and a neutral document,
to which any user side implementation, such as glibc, could be written.

The comments and documents for robust_futexes seem to be written
as if Linux had some special arrangement with glibc to be the sole
purveyor of the userland interface. Is that the case?

And half of this contract, the glibc code, isn't even explicitly
presented on this lkml thread.

Second, let me try again to explain what sort of more language neutral
Documentation I was hoping for, this time by example.

As it stands now, I have to read the kernel half of the code to figure
this out (yeah, I know, that complaint won't garner much sympathy on
this list ...)

Let me quit complaining and try to offer up something useful.

How about adding this to Documentation/robust-futexes.txt.

Be warned that the following may be seriously confused.

+++++++++++++++++++++++++++ Begin +++++++++++++++++++++++++++

The robust futex ABI
--------------------

Robust_futexes provide a mechanism that is used in addition to normal
futexes, for kernel assist of cleanup of held locks on task exit.

The interesting data as to what futexes a thread is holding is kept on
a linked list in user space, where it can be updated efficiently as
locks are taken and dropped, without kernel intervention. The only
additional kernel intervention required for robust_futexes above and
beyond what is required for futexes is:
1) a one time call, per thread, to tell the kernel where its list of
held robust_futexes begins, and
2) internal kernel code at exit, to handle any listed locks held
by the exiting thread.

The existing normal futexes already provide a "Fast Userspace Locking"
mechanism, which handles uncontested locking without needing a
system call, and handles contested locking by maintaining a list of
waiting threads in the kernel. Options on the sys_futex(2) system
call support waiting on a particular futex, and waking up the next
waiter on a particular futex.

For robust_futexes to work, the user code (typically in a library such
as glibc linked with the application) has to manage and place the
necessary list elements exactly as the kernel expects them. If it
fails to do so, then improperly listed locks will not be cleaned up
on exit, probably causing deadlock or other such failure of the other
threads waiting on the same locks.

A thread that anticipates possibly using robust_futexes should first
issue the system call:
asmlinkage long
sys_set_robust_list(struct robust_list_head __user *head, size_t len);

The pointer 'head' points to a structure in the threads address space
consisting of three words. Each word is 32 bits on 32 bit arch's,
or 64 bits on 64 bit arch's, and local byte order. Each thread should
have its own thread private 'head'.

If a thread is running in 32 bit compatibility mode on a 64 native
arch kernel, then it can actually have two such structures - one
using 32 bit words for 32 bit compatibility mode, and one using 64 bit
words for 64 bit native mode. The kernel, if it is a 64 bit kernel
supporting 32 bit compatibility mode, will attempt to process both
lists on each task exit, if the corresponding sys_set_robust_list()
call has been made to setup that list.

The first word in the memory structure at 'head' contains a
pointer to a single linked list of 'lock entries', one per lock,
as described below. If the list is empty, the pointer will point
to itself, 'head'. The last 'lock entry' points back to the 'head'.

The second word, called 'offset', specifies the offset from the
address of the associated 'lock entry', plus or minus, of what will
be called the 'lock word', from that 'lock entry'. The 'lock word'
is always a 32 bit word, unlike the other words above. The 'lock
word' holds 3 flag bits in the upper 3 bits, and the thread id (TID)
of the thread holding the lock in the bottom 29 bits. See further
below for a description of the flag bits.

The third word, called 'list_op_pending', contains transient copy of
the address of the 'lock entry', during list insertion and removal,
and is needed to correctly resolve races should a thread exit while
in the middle of a locking or unlocking operation.

Each 'lock entry' on the single linked list starting at 'head' consists
of just a single word, pointing to the next 'lock entry', or back to
'head' if there are no more entries. In addition, nearby to each
'lock entry', at an offset from the 'lock entry' specified by the
'offset' word, is one 'lock word'.

The 'lock word' is always 32 bits, and is intended to be the same 32
bit lock variable used by the futex mechanism, in conjunction with
robust_futexes. The kernel will only be able to wakeup the next thread
waiting for a lock on a threads exit if that next thread used the futex
mechanism to register the address of that 'lock word' with the kernel.

For each futex lock currently held by a thread, if it wants this
robust_futex support for exit cleanup of that lock, it should have
one 'lock entry' on this list, with its associated 'lock word' at
the specified 'offset'. Should a thread die while holding any such
locks, the kernel will walk this list, mark any such locks with a bit
indicating their holder died, and wakeup the next thread waiting for
that lock using the futex mechanism.

When a thread has invoked the above system call to indicate it
anticipates using robust_futexes, the kernel stores the passed in
'head' pointer for that task. The task may retrieve that value later
on by using the system call:
asmlinkage long
sys_get_robust_list(int pid, struct robust_list_head __user **head_ptr,
size_t __user *len_ptr);

It is anticipated that threads will use robust_futexes embedded in
larger, user level locking structures, one per lock. The kernel
robust_futex mechanism doesn't care what else is in that structure,
so long as the 'offset' to the 'lock word' is the same for all
robust_futexes used by that thread. The thread should link those
locks it currently holds using the 'lock entry' pointers. It may
also have other links between the locks, such as the reverse side of
a double linked list, but that doesn't matter to the kernel.

By keeping its locks linked this way, on a list starting with a 'head'
pointer known to the kernel, the kernel can provide to a thread the
essential service available for robust_futexes, which is to help
clean up locks held at the time of (a perhaps unexpectedly) exit.

Actual locking and unlocking, during normal operations, is handled
entirely by user level code in the contending threads, and by the
existing futex mechanism to wait for, and wakeup, locks. The kernels
only essential involvement in robust_futexes is to remember where the
list 'head' is, and to walk the list on thread exit, handling locks
still held by the departing thread, as described below.

There may exist thousands of futex lock structures in a threads
shared memory, on various data structures, at a given point in time.
Only those lock structures for locks currently held by that thread
should be on that thread's robust_futex linked lock list a given time.

A given futex lock structure in a user shared memory region may be held
at different times by any of the threads with access to that region.
The thread currently holding such a lock, if any, is marked with the
threads TID in the lower 29 bits of the 'lock word'.

When adding or removing a lock from its list of held locks, in order
for the kernel to correctly handle lock cleanup regardless of when
the task exits (perhaps it gets an unexpected signal 9 in the middle
of manipulating this list), the user code must observe the following
protocol on 'lock entry' insertion and removal:

On insertion:
1) set the 'list_op_pending' word to the address of the 'lock word'
to be inserted,
2) acquire the futex lock,
3) add the lock entry, with its thread id (TID) in the bottom 29 bits
of the 'lock word', to the linked list starting at 'head', and
4) clear the 'list_op_pending' word.

XXX I am particularly unsure of the following -pj XXX

On removal:
1) set the 'list_op_pending' word to the address of the 'lock word'
to be removed,
2) remove the lock entry for this lock from the 'head' list,
2) release the futex lock, and
2) clear the 'lock_op_pending' word.

On exit, the kernel will consider the address stored in
'list_op_pending' and the address of each 'lock word' found by walking
the list starting at 'head'. For each such address, if the bottom
29 bits of the 'lock word' at offset 'offset' from that address equals
the exiting threads TID, then the kernel will do two things:
1) if bit 31 (0x80000000) is set in that word, then attempt a futex
wakeup on that address, which will waken the next thread that has
used to the futex mechanism to wait on that address, and
2) atomically set bit 30 (0x40000000) in the 'lock word'.

In the above, bit 31 was set by futex waiters on that lock to indicate
they were waiting, and bit 30 is set by the kernel to indicate that
the lock owner died holding the lock.

The kernel exit code will silently stop scanning the list further
if at any point:
1) the 'head' pointer or an subsequent linked list pointer
is not a valid address of a user space word
2) the calculated location of the 'lock word' (address plus
'offset') is not the valud address of a 32 bit user space
word
3) if the list contains more than 1 million (subject to
future kernel configuration changes) elements.

When the kernel sees a list entry whose 'lock word' doesn't have the
current threads TID in the lower 29 bits, it does nothing with that
entry, and goes on to the next entry.

Bit 29 (0x20000000) of the 'lock word' is reserved for future use.

++++++++++++++++++++++++++++ End ++++++++++++++++++++++++++++

Other details ...

Nit ...

+If a futex is found to be held at exit time, the kernel sets the highest
+bit of the futex word:
+
+ #define FUTEX_OWNER_DIED 0x40000000

Contrary to the comment, that doesn't look like the "highest bit."

Confusion ...

+The list is guaranteed to be private and per-thread, so it's lockless.

This statement seems like it is stretching the truth a bit.
As best as I can tell, the 'head' is private per-thread, but the
elements on the list are shared by all contending threads, and so
adding and removing these elements from a given threads list requires
some sort of contention handling mechanism, which the code provides.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <pj@sgi.com> 1.925.600.0401
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Re: [patch 0/6] lightweight robust futexes: -V3 [ In reply to ]

mingo at elte

Feb 17, 2006, 1:41 AM

Post #27 of 30 (480 views)

Permalink

* Paul Jackson <pj@sgi.com> wrote:

> First, let me point out the tight coupling of this patch set, at least
> as currently presented, with glibc. Notice for example the following
> comment from your patch:
>
> + * NOTE: this structure is part of the syscall ABI, and must only be
> + * changed if the change is first communicated with the glibc folks.

Note that this is really business as usual: we already have dozens of
different 'struct' parameters to hundreds of syscalls, to all of which
exactly these restrictions apply: they must never be changed.

Furthermore there are a good deal of other implicit and explicit data
structure assumptions that all form the ABI - and which the kernel must
not break.

The only unusual thing i guess is that i documented it for this new bit
of functionality ;-)

[. In fact, the robust_list syscalls are shaped so that the structures
_can_ be changed if done with care (due to the length parameter). The
overwhelming majority of our other ABI assumptions are hardcoded and are
only changeable by writing totally new syscalls and phasing out the old
ones. ]

I agree with your suggestion of better documenting the
kernel<->userspace ABI, but this should be done independently of robust
futexes.

Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Re: [patch 0/6] lightweight robust futexes: -V3 [ In reply to ]

mingo at elte

Feb 17, 2006, 3:59 AM

Post #28 of 30 (478 views)

Permalink

* Paul Jackson <pj@sgi.com> wrote:

> [ ... nice writeup of the robust-futex ABI ... ]

can i put this into Documentation/robust-futex-ABI.txt?

> Other details ...
>
> Nit ...
>
> +If a futex is found to be held at exit time, the kernel sets the highest
> +bit of the futex word:
> +
> + #define FUTEX_OWNER_DIED 0x40000000
>
> Contrary to the comment, that doesn't look like the "highest bit."

ok, i fixed this in the text.

> Confusion ...
>
> +The list is guaranteed to be private and per-thread, so it's lockless.
>
> This statement seems like it is stretching the truth a bit.
> As best as I can tell, the 'head' is private per-thread, but the
> elements on the list are shared by all contending threads, and so
> adding and removing these elements from a given threads list requires
> some sort of contention handling mechanism, which the code provides.

well, from the kernel's perspective, the list _as it exists_ is private
and per-thread, so it can be accessed in a lockless way.

from the userspace perspective you are right, it's only private if the
list entry is manipulated after acquiring the lock.

i fixed up the text to reflect this.

Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Re: [patch 0/6] lightweight robust futexes: -V3 [ In reply to ]

pj at sgi

Feb 17, 2006, 12:50 PM

Post #29 of 30 (471 views)

Permalink

> > [ ... nice writeup of the robust-futex ABI ... ]
>
> can i put this into Documentation/robust-futex-ABI.txt?

Good idea - so be it.

Could you review it for accuracy -- I'm sure I screwed
it up in some details, large or small.

Ulrich -- if you're reading this -- your review comments
would be most welcome as well.

In particular:
1) See the description of the removal protocol, below
the XXX comment. I was really guessing there.
2) Could you add a statement on how current code should
handle the FUTEX_OWNER_PENDING bit (when to set it,
when to clear it, when to preserve it) so that current
code won't be incompatible with likely future uses of
this big?
3) You have implicit ABI versioning in the size of the
head struct. Could you add words describing that?

Thanks.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <pj@sgi.com> 1.925.600.0401
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Re: [patch 0/6] lightweight robust futexes: -V3 - Why in userspace? [ In reply to ]

andrew.j.wade at gmail

Feb 17, 2006, 3:47 PM

Post #30 of 30 (474 views)

Permalink

[Resending: I accidentally did reply one.]

On Thursday 16 February 2006 18:20, Esben Nielsen wrote:
> On Thu, 16 Feb 2006, Ingo Molnar wrote:
>
> >
> > * Esben Nielsen <simlo@phys.au.dk> wrote:
> >
> > > As I understand the protocol the userspace task writes it's pid into
> > > the lock atomically when locking it and erases it atomically when it
> > > leaves the lock. If it is killed inbetween the pid is still there. Now
> > > if another task comes along it reads the pid, sets the wait flag and
> > > goes into the kernel. The kernel will now be able to see that the pid
> > > is no longer valid and therefore the owner must be dead.
> >
> > this is racy - we cannot know whether the PID wrapped around.
> >
> What about adding more bits to check on? The PID to lookup the task_t and
> then some extra bits to uniquely identify the actual task.

The extra identifying bits don't even have to be written at the same
time/place as the PID. They can be written after the futex is aquired,
and cleared before the futex is released. A mechanism similar to
list_op_pending can be used to fill the races: If the extra-id field is
clear, the kernel checks all the list_op_pendings registered for that PID
to see if any of the threads is in the process of aquiring/releasing that
futex. If not, FUTEX_OWNER_DIED. This last process is liable to be quite
nasty and heavy-weight, but it should also be rare if the races are small.

I believe all the races in the last process can be closed if we can count
on FUTEX_WAITERS informing us (the testing process) if the futex is
released. For each thread that might be holding the futex, if
list_op_pending doesn't equal the futex, then that thread can't be aquiring
the futex. If the extra_id is still clear, then that thread can't be holding
the futex. If list_op_pending still doesn't equal the futex, then that
thread can't be freeing the futex. Therefore that thread doesn't have the
futex. So if no threads are holding the futex, and the futex hasn't been
released during this process, then the owner must be dead.

[.This assumes that the freeing thread is the same as the one that aquired
the mutex. This assumption can be relaxed to the freeing thread
being merely in the same process, but beyond that a syscall would be
needed on the freeing side to avoid races.]

I hope this is clear, I'd need to get up to speed on kernel hacking before
I could turn this into code.

Andrew Wade
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/