Mailing List Archive

[RFC][PATCH 0/5] VM_PINNED
Hi all,

I mentioned at LSF/MM that I wanted to revive this, and at the time there were
no disagreements.

I finally got around to refreshing the patch(es) so here goes.

These patches introduce VM_PINNED infrastructure, vma tracking of persistent
'pinned' page ranges. Pinned is anything that has a fixed phys address (as
required for say IO DMA engines) and thus cannot use the weaker VM_LOCKED. One
popular way to pin pages is through get_user_pages() but that not nessecarily
the only way.

Roland, as said, I need some IB assistance, see patches 4 and 5, where I got
lost in the qib and ipath code.

Patches 1-3 compile tested.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 0/5] VM_PINNED [ In reply to ]
On Mon, May 26, 2014 at 6:56 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> Hi all,
>
> I mentioned at LSF/MM that I wanted to revive this, and at the time there were
> no disagreements.
>
> I finally got around to refreshing the patch(es) so here goes.
>
> These patches introduce VM_PINNED infrastructure, vma tracking of persistent
> 'pinned' page ranges. Pinned is anything that has a fixed phys address (as
> required for say IO DMA engines) and thus cannot use the weaker VM_LOCKED. One
> popular way to pin pages is through get_user_pages() but that not nessecarily
> the only way.

Lol, this looks like resurrection of VM_RESERVED which I've removed
not so long time ago.

Maybe single-bit state isn't flexible enought?
This supposed to supports pinning only by one user and only in its own mm?

This might be done as extension of existing memory-policy engine.
It allows to keep vm_area_struct slim in normal cases and change
behaviour when needed.
memory-policy might hold reference-counter of "pinners", track
ownership and so on.

>
> Roland, as said, I need some IB assistance, see patches 4 and 5, where I got
> lost in the qib and ipath code.
>
> Patches 1-3 compile tested.
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 0/5] VM_PINNED [ In reply to ]
On Tue, May 27, 2014 at 12:19:16AM +0400, Konstantin Khlebnikov wrote:
> On Mon, May 26, 2014 at 6:56 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> > Hi all,
> >
> > I mentioned at LSF/MM that I wanted to revive this, and at the time there were
> > no disagreements.
> >
> > I finally got around to refreshing the patch(es) so here goes.
> >
> > These patches introduce VM_PINNED infrastructure, vma tracking of persistent
> > 'pinned' page ranges. Pinned is anything that has a fixed phys address (as
> > required for say IO DMA engines) and thus cannot use the weaker VM_LOCKED. One
> > popular way to pin pages is through get_user_pages() but that not nessecarily
> > the only way.
>
> Lol, this looks like resurrection of VM_RESERVED which I've removed
> not so long time ago.

Not sure what VM_RESERVED did, but there might be a similarity.

> Maybe single-bit state isn't flexible enought?

Not sure what you mean, the one bit is perfectly fine for what I want it
to do.

> This supposed to supports pinning only by one user and only in its own mm?

Pretty much, that's adequate for all users I'm aware of and mirrors the
mlock semantics.

> This might be done as extension of existing memory-policy engine.
> It allows to keep vm_area_struct slim in normal cases and change
> behaviour when needed.
> memory-policy might hold reference-counter of "pinners", track
> ownership and so on.

That all sounds like raping the mempolicy code and massive over
engineering.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 0/5] VM_PINNED [ In reply to ]
On Tue, May 27, 2014 at 12:32 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Tue, May 27, 2014 at 12:19:16AM +0400, Konstantin Khlebnikov wrote:
>> On Mon, May 26, 2014 at 6:56 PM, Peter Zijlstra <peterz@infradead.org> wrote:
>> > Hi all,
>> >
>> > I mentioned at LSF/MM that I wanted to revive this, and at the time there were
>> > no disagreements.
>> >
>> > I finally got around to refreshing the patch(es) so here goes.
>> >
>> > These patches introduce VM_PINNED infrastructure, vma tracking of persistent
>> > 'pinned' page ranges. Pinned is anything that has a fixed phys address (as
>> > required for say IO DMA engines) and thus cannot use the weaker VM_LOCKED. One
>> > popular way to pin pages is through get_user_pages() but that not nessecarily
>> > the only way.
>>
>> Lol, this looks like resurrection of VM_RESERVED which I've removed
>> not so long time ago.

That was swap-out prevention from 2.4 era, something between VM_IO and
VM_LOCKED with various side effects.

>
> Not sure what VM_RESERVED did, but there might be a similarity.
>
>> Maybe single-bit state isn't flexible enought?
>
> Not sure what you mean, the one bit is perfectly fine for what I want it
> to do.
>
>> This supposed to supports pinning only by one user and only in its own mm?
>
> Pretty much, that's adequate for all users I'm aware of and mirrors the
> mlock semantics.

Ok, fine. Because get_user_pages is used sometimes for pinning pages
from different mm.

Another suggestion. VM_RESERVED is stronger than VM_LOCKED and extends
its functionality.
Maybe it's easier to add VM_DONTMIGRATE and use it together with VM_LOCKED.
This will make accounting easier. No?

>
>> This might be done as extension of existing memory-policy engine.
>> It allows to keep vm_area_struct slim in normal cases and change
>> behaviour when needed.
>> memory-policy might hold reference-counter of "pinners", track
>> ownership and so on.
>
> That all sounds like raping the mempolicy code and massive over
> engineering.
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 0/5] VM_PINNED [ In reply to ]
On Tue, May 27, 2014 at 12:49:08AM +0400, Konstantin Khlebnikov wrote:
> On Tue, May 27, 2014 at 12:32 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> > Pretty much, that's adequate for all users I'm aware of and mirrors the
> > mlock semantics.
>
> Ok, fine. Because get_user_pages is used sometimes for pinning pages
> from different mm.

Yeah, but that's fairly uncommon, and not something we do for very long
times afaik.

In fact I could only find:

drivers/iommu/amd_iommu_v2.c

fs/exec.c -- temporary use
kernel/events/uprobes.c -- temporary use
mm/ksm.c -- temporary use
mm/process_vm_access.c -- temporary use

With exception of the iommu one (it wasn't immediately obvious and I
didn't want to stare at the iommu muck too long), they're all temporary,
we drop the page almost immediately again after doing some short work.

The things I care about for VM_PINNED are long term pins, like the IB
stuff, which sets up its RDMA buffers at the start of a program and
basically leaves them in place for the entire duration of said program.

Such pins will disrupt CMA, compaction and pretty much anything that
relies on the page blocks stuff.

> Another suggestion. VM_RESERVED is stronger than VM_LOCKED and extends
> its functionality.
> Maybe it's easier to add VM_DONTMIGRATE and use it together with VM_LOCKED.
> This will make accounting easier. No?

I prefer the PINNED name because the not being able to migrate is only
one of the desired effects of it, not the primary effect. We're really
looking to keep physical pages in place and preserve mappings.

The -rt people for example really want to avoid faults (even minor
faults), and DONTMIGRATE would still allow unmapping.

Maybe always setting VM_PINNED and VM_LOCKED together is easier, I
hadn't considered that. The first thing that came to mind is that that
might make the fork() semantics difficult, but maybe it works out.

And while we're on the subject, my patch preserves PINNED over fork()
but maybe we don't actually need that either.
Re: [RFC][PATCH 0/5] VM_PINNED [ In reply to ]
On Tue, May 27, 2014 at 12:29:09PM +0200, Peter Zijlstra wrote:
> On Tue, May 27, 2014 at 12:49:08AM +0400, Konstantin Khlebnikov wrote:
> > Another suggestion. VM_RESERVED is stronger than VM_LOCKED and extends
> > its functionality.
> > Maybe it's easier to add VM_DONTMIGRATE and use it together with VM_LOCKED.
> > This will make accounting easier. No?
>
> I prefer the PINNED name because the not being able to migrate is only
> one of the desired effects of it, not the primary effect. We're really
> looking to keep physical pages in place and preserve mappings.
>
> The -rt people for example really want to avoid faults (even minor
> faults), and DONTMIGRATE would still allow unmapping.
>
> Maybe always setting VM_PINNED and VM_LOCKED together is easier, I
> hadn't considered that. The first thing that came to mind is that that
> might make the fork() semantics difficult, but maybe it works out.
>
> And while we're on the subject, my patch preserves PINNED over fork()
> but maybe we don't actually need that either.

So pinned_vm is userspace exposed, which means we have to maintain the
individual counts, and doing the fully orthogonal accounting is 'easier'
than trying to get the boundary cases right.

That is, if we have a program that does mlockall() and then does the IB
ioctl() to 'pin' a region, we'd have to make mm_mpin() do munlock()
after it splits the vma, and then do the pinned accounting.

Also, we'll have lost the LOCKED state and unless MCL_FUTURE was used,
we don't know what to restore the vma to on mm_munpin().

So while the accounting looks tricky, it has simpler semantics.
Re: [RFC][PATCH 0/5] VM_PINNED [ In reply to ]
On Tue, May 27, 2014 at 2:54 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Tue, May 27, 2014 at 12:29:09PM +0200, Peter Zijlstra wrote:
>> On Tue, May 27, 2014 at 12:49:08AM +0400, Konstantin Khlebnikov wrote:
>> > Another suggestion. VM_RESERVED is stronger than VM_LOCKED and extends
>> > its functionality.
>> > Maybe it's easier to add VM_DONTMIGRATE and use it together with VM_LOCKED.
>> > This will make accounting easier. No?
>>
>> I prefer the PINNED name because the not being able to migrate is only
>> one of the desired effects of it, not the primary effect. We're really
>> looking to keep physical pages in place and preserve mappings.

Ah, I just mixed it up.

>>
>> The -rt people for example really want to avoid faults (even minor
>> faults), and DONTMIGRATE would still allow unmapping.
>>
>> Maybe always setting VM_PINNED and VM_LOCKED together is easier, I
>> hadn't considered that. The first thing that came to mind is that that
>> might make the fork() semantics difficult, but maybe it works out.
>>
>> And while we're on the subject, my patch preserves PINNED over fork()
>> but maybe we don't actually need that either.
>
> So pinned_vm is userspace exposed, which means we have to maintain the
> individual counts, and doing the fully orthogonal accounting is 'easier'
> than trying to get the boundary cases right.
>
> That is, if we have a program that does mlockall() and then does the IB
> ioctl() to 'pin' a region, we'd have to make mm_mpin() do munlock()
> after it splits the vma, and then do the pinned accounting.
>
> Also, we'll have lost the LOCKED state and unless MCL_FUTURE was used,
> we don't know what to restore the vma to on mm_munpin().
>
> So while the accounting looks tricky, it has simpler semantics.

What if VM_PINNED will require VM_LOCKED?
I.e. user must mlock it before pining and cannot munlock vma while it's pinned.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 0/5] VM_PINNED [ In reply to ]
On 05/27/2014 01:11 PM, Konstantin Khlebnikov wrote:
> On Tue, May 27, 2014 at 2:54 PM, Peter Zijlstra <peterz@infradead.org> wrote:
>> On Tue, May 27, 2014 at 12:29:09PM +0200, Peter Zijlstra wrote:
>>> On Tue, May 27, 2014 at 12:49:08AM +0400, Konstantin Khlebnikov wrote:
>>>> Another suggestion. VM_RESERVED is stronger than VM_LOCKED and extends
>>>> its functionality.
>>>> Maybe it's easier to add VM_DONTMIGRATE and use it together with VM_LOCKED.
>>>> This will make accounting easier. No?
>>>
>>> I prefer the PINNED name because the not being able to migrate is only
>>> one of the desired effects of it, not the primary effect. We're really
>>> looking to keep physical pages in place and preserve mappings.
>
> Ah, I just mixed it up.
>
>>>
>>> The -rt people for example really want to avoid faults (even minor
>>> faults), and DONTMIGRATE would still allow unmapping.
>>>
>>> Maybe always setting VM_PINNED and VM_LOCKED together is easier, I
>>> hadn't considered that. The first thing that came to mind is that that
>>> might make the fork() semantics difficult, but maybe it works out.
>>>
>>> And while we're on the subject, my patch preserves PINNED over fork()
>>> but maybe we don't actually need that either.
>>
>> So pinned_vm is userspace exposed, which means we have to maintain the
>> individual counts, and doing the fully orthogonal accounting is 'easier'
>> than trying to get the boundary cases right.
>>
>> That is, if we have a program that does mlockall() and then does the IB
>> ioctl() to 'pin' a region, we'd have to make mm_mpin() do munlock()
>> after it splits the vma, and then do the pinned accounting.
>>
>> Also, we'll have lost the LOCKED state and unless MCL_FUTURE was used,
>> we don't know what to restore the vma to on mm_munpin().
>>
>> So while the accounting looks tricky, it has simpler semantics.
>
> What if VM_PINNED will require VM_LOCKED?
> I.e. user must mlock it before pining and cannot munlock vma while it's pinned.

Mlocking makes sense, as pages won't be uselessly scanned on
non-evictable LRU, no? (Or maybe I just don't see that something else
prevents then from being there already).

Anyway I like the idea of playing nicer with compaction etc.

> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 0/5] VM_PINNED [ In reply to ]
On Tue, May 27, 2014 at 03:11:36PM +0400, Konstantin Khlebnikov wrote:
> On Tue, May 27, 2014 at 2:54 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> > On Tue, May 27, 2014 at 12:29:09PM +0200, Peter Zijlstra wrote:
> >> On Tue, May 27, 2014 at 12:49:08AM +0400, Konstantin Khlebnikov wrote:
> >> > Another suggestion. VM_RESERVED is stronger than VM_LOCKED and extends
> >> > its functionality.
> >> > Maybe it's easier to add VM_DONTMIGRATE and use it together with VM_LOCKED.
> >> > This will make accounting easier. No?
> >>
> >> I prefer the PINNED name because the not being able to migrate is only
> >> one of the desired effects of it, not the primary effect. We're really
> >> looking to keep physical pages in place and preserve mappings.
>
> Ah, I just mixed it up.
>
> >>
> >> The -rt people for example really want to avoid faults (even minor
> >> faults), and DONTMIGRATE would still allow unmapping.
> >>
> >> Maybe always setting VM_PINNED and VM_LOCKED together is easier, I
> >> hadn't considered that. The first thing that came to mind is that that
> >> might make the fork() semantics difficult, but maybe it works out.
> >>
> >> And while we're on the subject, my patch preserves PINNED over fork()
> >> but maybe we don't actually need that either.
> >
> > So pinned_vm is userspace exposed, which means we have to maintain the
> > individual counts, and doing the fully orthogonal accounting is 'easier'
> > than trying to get the boundary cases right.
> >
> > That is, if we have a program that does mlockall() and then does the IB
> > ioctl() to 'pin' a region, we'd have to make mm_mpin() do munlock()
> > after it splits the vma, and then do the pinned accounting.
> >
> > Also, we'll have lost the LOCKED state and unless MCL_FUTURE was used,
> > we don't know what to restore the vma to on mm_munpin().
> >
> > So while the accounting looks tricky, it has simpler semantics.
>
> What if VM_PINNED will require VM_LOCKED?
> I.e. user must mlock it before pining and cannot munlock vma while it's pinned.

So I don't like restrictions like that if its at all possible to avoid
-- and in this case, I already wrote the code and its not _that_
complicated.

But also; that would mean that we'd either have to make mm_mpin() do the
mlock unconditionally (which rather defeats the purpose) or break
userspace assumptions. I'm fairly sure the IB ioctl() don't require the
memory to be mlocked.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 0/5] VM_PINNED [ In reply to ]
On Tue, May 27, 2014 at 01:50:47PM +0200, Vlastimil Babka wrote:
> > What if VM_PINNED will require VM_LOCKED?
> > I.e. user must mlock it before pining and cannot munlock vma while it's pinned.
>
> Mlocking makes sense, as pages won't be uselessly scanned on
> non-evictable LRU, no? (Or maybe I just don't see that something else
> prevents then from being there already).

We can add VM_PINNED logic to page_check_reference() and
try_to_unmap_one() to avoid the scanning if that's a problem. But that's
additional bits.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 0/5] VM_PINNED [ In reply to ]
On Tue, 27 May 2014, Peter Zijlstra wrote:

> The things I care about for VM_PINNED are long term pins, like the IB
> stuff, which sets up its RDMA buffers at the start of a program and
> basically leaves them in place for the entire duration of said program.

Ok that also means the pages are not to be allocated from ZONE_MOVABLE?

I expected the use of a page flag. With a vma flag we may have a situation
that mapping a page into a vma changes it to pinned and terminating a
process may unpin a page. That means the zone that the page should be
allocated from changes.

Pinned pages in ZONE_MOVABLE are not a good idea. But since "kernelcore"
is rarely used maybe that is not an issue?




--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 0/5] VM_PINNED [ In reply to ]
On Tue, May 27, 2014 at 09:34:22AM -0500, Christoph Lameter wrote:
> On Tue, 27 May 2014, Peter Zijlstra wrote:
>
> > The things I care about for VM_PINNED are long term pins, like the IB
> > stuff, which sets up its RDMA buffers at the start of a program and
> > basically leaves them in place for the entire duration of said program.
>
> Ok that also means the pages are not to be allocated from ZONE_MOVABLE?

Well, like with IB, they start out as normal userspace pages, and will
be from ZONE_MOVABLE.

> I expected the use of a page flag. With a vma flag we may have a situation
> that mapping a page into a vma changes it to pinned and terminating a
> process may unpin a page. That means the zone that the page should be
> allocated from changes.

So the only way to 'map' something into pinned is what perf does (have
the f_ops->mmap call set VM_PINNED). But that way already ensures we
have full control over the allocation since its a custom file.

And in fact the perf buffer is allocated with GFP_KERNEL and is thus
already not from MOVABLE.

Any other use, like (again) the IB stuff, will go through
get_user_pages() which will ensure all the pages are mapped and present.

So I don't think this is a real problem and certainly not one that
requires a page flag.

> Pinned pages in ZONE_MOVABLE are not a good idea. But since "kernelcore"
> is rarely used maybe that is not an issue?

Well, the idea was to migrate pages to a more suitable location on
mm_mpin(). We could choose to move them out again on mm_munpin() or not.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 0/5] VM_PINNED [ In reply to ]
On Tue, 27 May 2014, Peter Zijlstra wrote:

> Well, like with IB, they start out as normal userspace pages, and will
> be from ZONE_MOVABLE.

Well we could change that now I think. If the VMA has VM_PINNED set
pages then do not allocate from ZONE_MOVABLE.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 0/5] VM_PINNED [ In reply to ]
On Tue, May 27, 2014 at 10:14:10AM -0500, Christoph Lameter wrote:
> On Tue, 27 May 2014, Peter Zijlstra wrote:
>
> > Well, like with IB, they start out as normal userspace pages, and will
> > be from ZONE_MOVABLE.
>
> Well we could change that now I think. If the VMA has VM_PINNED set
> pages then do not allocate from ZONE_MOVABLE.

But most allocations sites don't have the vma. We allocate page-cache
pages based on its address_space/mapping, not on whatever vma they're
mapped into.

So I still think the sanest way to do this is by making mm_mpin() do a
mm_populate() and have reclaim skip VM_PINNED pages (so they stay
present), and then migrate the lot out of MOVABLE.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 0/5] VM_PINNED [ In reply to ]
On Tue, 27 May 2014, Peter Zijlstra wrote:

> On Tue, May 27, 2014 at 10:14:10AM -0500, Christoph Lameter wrote:
> > On Tue, 27 May 2014, Peter Zijlstra wrote:
> >
> > > Well, like with IB, they start out as normal userspace pages, and will
> > > be from ZONE_MOVABLE.
> >
> > Well we could change that now I think. If the VMA has VM_PINNED set
> > pages then do not allocate from ZONE_MOVABLE.
>
> But most allocations sites don't have the vma. We allocate page-cache
> pages based on its address_space/mapping, not on whatever vma they're
> mapped into.

Most allocations by the application for an address range also must
consider a memory allocation policy which is also bound to a vma and we
have code for that in mm/mempolicy.c

Code could be easily added to alloc_pages_vma() to consider the pinned
status on allocation. Remove GFP_MOVABLE if the vma is pinned.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 0/5] VM_PINNED [ In reply to ]
On Tue, May 27, 2014 at 11:31:21AM -0500, Christoph Lameter wrote:
> On Tue, 27 May 2014, Peter Zijlstra wrote:
>
> > On Tue, May 27, 2014 at 10:14:10AM -0500, Christoph Lameter wrote:
> > > On Tue, 27 May 2014, Peter Zijlstra wrote:
> > >
> > > > Well, like with IB, they start out as normal userspace pages, and will
> > > > be from ZONE_MOVABLE.
> > >
> > > Well we could change that now I think. If the VMA has VM_PINNED set
> > > pages then do not allocate from ZONE_MOVABLE.
> >
> > But most allocations sites don't have the vma. We allocate page-cache
> > pages based on its address_space/mapping, not on whatever vma they're
> > mapped into.
>
> Most allocations by the application for an address range also must
> consider a memory allocation policy which is also bound to a vma and we
> have code for that in mm/mempolicy.c
>
> Code could be easily added to alloc_pages_vma() to consider the pinned
> status on allocation. Remove GFP_MOVABLE if the vma is pinned.

Yes, but alloc_pages_vma() isn't used for shared pages (with exception
of shmem and hugetlbfs).

So whichever way around we have to do the mm_populate() + eviction hook
+ migration code, and since that equally covers the anon case, why
bother?


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 0/5] VM_PINNED [ In reply to ]
On Tue, 27 May 2014, Peter Zijlstra wrote:

> > Code could be easily added to alloc_pages_vma() to consider the pinned
> > status on allocation. Remove GFP_MOVABLE if the vma is pinned.
>
> Yes, but alloc_pages_vma() isn't used for shared pages (with exception
> of shmem and hugetlbfs).

alloc_pages_vma() is used for all paths where we populate address ranges
with pages. This is what we are doing when pinning. Pages are not
allocated outside of a vma context.

What do you mean by shared pages that are not shmem pages? AnonPages that
are referenced from multiple processes?

> So whichever way around we have to do the mm_populate() + eviction hook
> + migration code, and since that equally covers the anon case, why
> bother?

Migration is expensive and the memory registration overhead already
causes lots of complaints.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 0/5] VM_PINNED [ In reply to ]
On Tue, May 27, 2014 at 11:56:44AM -0500, Christoph Lameter wrote:
> On Tue, 27 May 2014, Peter Zijlstra wrote:
>
> > > Code could be easily added to alloc_pages_vma() to consider the pinned
> > > status on allocation. Remove GFP_MOVABLE if the vma is pinned.
> >
> > Yes, but alloc_pages_vma() isn't used for shared pages (with exception
> > of shmem and hugetlbfs).
>
> alloc_pages_vma() is used for all paths where we populate address ranges
> with pages. This is what we are doing when pinning. Pages are not
> allocated outside of a vma context.
>
> What do you mean by shared pages that are not shmem pages? AnonPages that
> are referenced from multiple processes?

Regular files.. they get allocated through __page_cache_alloc(). AFAIK
there is nothing stopping people from pinning file pages for RDMA or
other purposes. Unusual maybe, but certainly not impossible, and
therefore we must be able to handle it.

> > So whichever way around we have to do the mm_populate() + eviction hook
> > + migration code, and since that equally covers the anon case, why
> > bother?
>
> Migration is expensive and the memory registration overhead already
> causes lots of complaints.

Sure, but first to the simple thing, then if its a problem do something
else.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 0/5] VM_PINNED [ In reply to ]
On Tue, 27 May 2014, Peter Zijlstra wrote:

> > What do you mean by shared pages that are not shmem pages? AnonPages that
> > are referenced from multiple processes?
>
> Regular files.. they get allocated through __page_cache_alloc(). AFAIK
> there is nothing stopping people from pinning file pages for RDMA or
> other purposes. Unusual maybe, but certainly not impossible, and
> therefore we must be able to handle it.

Typically structures for RDMA are allocated on the heap.

The main use case is pinnning the executable pages in the page cache?

Pinning mmapped pagecache pages may not have the desired effect
if those pages are modified and need updates on disk with corresponding
faults to track the dirty state etc. This may get more complicated.

> > Migration is expensive and the memory registration overhead already
> > causes lots of complaints.
>
> Sure, but first to the simple thing, then if its a problem do something
> else.

I thought the main issue here were the pinning of IB/RDMA buffers.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 0/5] VM_PINNED [ In reply to ]
On Tue, May 27, 2014 at 03:00:15PM -0500, Christoph Lameter wrote:
> On Tue, 27 May 2014, Peter Zijlstra wrote:
>
> > > What do you mean by shared pages that are not shmem pages? AnonPages that
> > > are referenced from multiple processes?
> >
> > Regular files.. they get allocated through __page_cache_alloc(). AFAIK
> > there is nothing stopping people from pinning file pages for RDMA or
> > other purposes. Unusual maybe, but certainly not impossible, and
> > therefore we must be able to handle it.
>
> Typically structures for RDMA are allocated on the heap.

Sure, typically. But that's not enough.

> The main use case is pinnning the executable pages in the page cache?

No.. although that's one of the things the -rt people are interested in.

> > > Migration is expensive and the memory registration overhead already
> > > causes lots of complaints.
> >
> > Sure, but first to the simple thing, then if its a problem do something
> > else.
>
> I thought the main issue here were the pinning of IB/RDMA buffers.

It is,.. but you have to deal with the generic case before you go off
doing specific things.

You're approaching the problem from the wrong way; first make it such
that everything works, only then, optimize some specific case, if and
when it becomes important.

Don't start by only looking at the one specific case you're interested
in and forget about everything else.
Re: [RFC][PATCH 0/5] VM_PINNED [ In reply to ]
On Mon, 2014-05-26 at 22:32 +0200, Peter Zijlstra wrote:

> Not sure what you mean, the one bit is perfectly fine for what I want it
> to do.
>
> > This supposed to supports pinning only by one user and only in its own mm?
>
> Pretty much, that's adequate for all users I'm aware of and mirrors the
> mlock semantics.

Ok so I only just saw this. CC'ing Alex Williamson

There is definitely another potential user for that stuff which is KVM
with passed-through devices.

What vfio does today on x86 is "interesting":

Look at drivers/vfio/vfio_iommu_type1.c and functions vfio_pin_pages()

I especially like the racy "delayed" accounting ...

The problem is that in the generic case of VFIO, we don't know in
advance what needs to be pinned. The user might pin pages on demand and
it has to be a reasonably fast path.

Additionally, a given page can be mapped multiple times and we don't
have a good place to keep a counter....

So the one bit of state is definitely not enough.

Cheers,
Ben.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/