Mailing List Archive

1 2  View All
Re: [RFC 0/8] Define coherent device memory node [ In reply to ]
On Fri, Oct 28, 2016 at 11:17:31AM +0530, Anshuman Khandual wrote:
> On 10/27/2016 08:35 PM, Jerome Glisse wrote:
> > On Thu, Oct 27, 2016 at 12:33:05PM +0530, Anshuman Khandual wrote:
> >> On 10/27/2016 10:08 AM, Anshuman Khandual wrote:
> >>> On 10/26/2016 09:32 PM, Jerome Glisse wrote:
> >>>> On Wed, Oct 26, 2016 at 04:43:10PM +0530, Anshuman Khandual wrote:
> >>>>> On 10/26/2016 12:22 AM, Jerome Glisse wrote:
> >>>>>> On Tue, Oct 25, 2016 at 11:01:08PM +0530, Aneesh Kumar K.V wrote:
> >>>>>>> Jerome Glisse <j.glisse@gmail.com> writes:
> >>>>>>>> On Tue, Oct 25, 2016 at 10:29:38AM +0530, Aneesh Kumar K.V wrote:
> >>>>>>>>> Jerome Glisse <j.glisse@gmail.com> writes:
> >>>>>>>>>> On Mon, Oct 24, 2016 at 10:01:49AM +0530, Anshuman Khandual wrote:
> >
> > [...]
> >
> >>>> In my patchset there is no policy, it is all under device driver control which
> >>>> decide what range of memory is migrated and when. I think only device driver as
> >>>> proper knowledge to make such decision. By coalescing data from GPU counters and
> >>>> request from application made through the uppler level programming API like
> >>>> Cuda.
> >>>>
> >>>
> >>> Right, I understand that. But what I pointed out here is that there are problems
> >>> now migrating user mapped pages back and forth between LRU system RAM memory and
> >>> non LRU device memory which is yet to be solved. Because you are proposing a non
> >>> LRU based design with ZONE_DEVICE, how we are solving/working around these
> >>> problems for bi-directional migration ?
> >>
> >> Let me elaborate on this bit more. Before non LRU migration support patch series
> >> from Minchan, it was not possible to migrate non LRU pages which are generally
> >> driver managed through migrate_pages interface. This was affecting the ability
> >> to do compaction on platforms which has a large share of non LRU pages. That series
> >> actually solved the migration problem and allowed compaction. But it still did not
> >> solve the migration problem for non LRU *user mapped* pages. So if the non LRU pages
> >> are mapped into a process's page table and being accessed from user space, it can
> >> not be moved using migrate_pages interface.
> >>
> >> Minchan had a draft solution for that problem which is still hosted here. On his
> >> suggestion I had tried this solution but still faced some other problems during
> >> mapped pages migration. (NOTE: IIRC this was not posted in the community)
> >>
> >> git://git.kernel.org/pub/scm/linux/kernel/git/minchan/linux.git with the following
> >> branch (non-lru-mapped-v1r2-v4.7-rc4-mmotm-2016-06-24-15-53)
> >>
> >> As I had mentioned earlier, we intend to support all possible migrations between
> >> system RAM (LRU) and device memory (Non LRU) for user space mapped pages.
> >>
> >> (1) System RAM (Anon mapping) --> Device memory, back and forth many times
> >> (2) System RAM (File mapping) --> Device memory, back and forth many times
> >
> > I achieve this 2 objective in HMM, i sent you the additional patches for file
> > back page migration. I am not done working on them but they are small.
>
> Sure, will go through them. Thanks !
>
> >
> >
> >> This is not happening now with non LRU pages. Here are some of reasons but before
> >> that some notes.
> >>
> >> * Driver initiates all the migrations
> >> * Driver does the isolation of pages
> >> * Driver puts the isolated pages in a linked list
> >> * Driver passes the linked list to migrate_pages interface for migration
> >> * IIRC isolation of non LRU pages happens through page->as->aops->isolate_page call
> >> * If migration fails, call page->as->aops->putback_page to give the page back to the
> >> device driver
> >>
> >> 1. queue_pages_range() currently does not work with non LRU pages, needs to be fixed
> >>
> >> 2. After a successful migration from non LRU device memory to LRU system RAM, the non
> >> LRU will be freed back. Right now migrate_pages releases these pages to buddy, but
> >> in this situation we need the pages to be given back to the driver instead. Hence
> >> migrate_pages needs to be changed to accommodate this.
> >>
> >> 3. After LRU system RAM to non LRU device migration for a mapped page, does the new
> >> page (which came from device memory) will be part of core MM LRU either for Anon
> >> or File mapping ?
> >>
> >> 4. After LRU (Anon mapped) system RAM to non LRU device migration for a mapped page,
> >> how we are going to store "address_space->address_space_operations" and "Anon VMA
> >> Chain" reverse mapping information both on the page->mapping element ?
> >>
> >> 5. After LRU (File mapped) system RAM to non LRU device migration for a mapped page,
> >> how we are going to store "address_space->address_space_operations" of the device
> >> driver and radix tree based reverse mapping information for the existing file
> >> mapping both on the same page->mapping element ?
> >>
> >> 6. IIRC, it was not possible to retain the non LRU identify (page->as->aops which will
> >> defined inside the device driver) and the reverse mapping information (either anon
> >> or file mapping) together after first round of migration. This non LRU identity needs
> >> to be retained continuously if we ever need to return this page to device driver after
> >> successful migration to system RAM or for isolation/putback purpose or something else.
> >>
> >> All the reasons explained above was preventing a continuous ping-pong scheme of migration
> >> between system RAM LRU buddy pages and device memory non LRU pages which is one of the
> >> primary requirements for exploiting coherent device memory. Do you think we can solve these
> >> problems with ZONE_DEVICE and HMM framework ?
> >
> > Well HMM already achieve migration but design is slightly different :
> > * Device driver initiate migration by calling hmm_migrate(mm, start, end, pfn_array);
> > It must provide a pfn_array that is big enough to have one entry per page for the
> > range (so ((end - start) >> PAGE_SHIFT) entries). With this array no list of page.
>
> If we are not going to use standard core migrate_pages() interface, there is no need
> of building a linked list of isolated source pages for migration. Though I see a
> different hmm_migrate() function in the V13 tree which involves hmm_migrate structure,
> lets focus on hmm_migrate(mm, start, end, pfn_array) format. I guess (mm, start, end)
> describes the virtual range of a process which needs to be migrated and pfn_array[]
> is the destination array of PFNs for the migration ?

The hmm_migrate struct is just a place holder for all the argument (vma, start, end,
pfn_arrays ptr, flags, ...). I can hide it inside the migrate function, it is easier
to pass around for sub-functions that always having a long list of arg.

>
> * I assume pfn_array[] can contain either system RAM PFN or device memory PFN ? It
> will support migration in both directions ?

Correct both direction are supported.

>
> * Device memory PFN can have struct pages (If ZONE_DEVICE based) or it may not have
> struct pages ?

Memory must have a struct page, this is needed so that anon_vma and mapping for file
back page are properly being track.

> >
> > * hmm_migrate() collect source pages from the process. Right now it will only migrate
> > thing that have been faulted ie with a valid CPU page table entry and will ignore
> > swap entry, or any other special CPU page table entry. Those source pages are store
> > in the pfn array (using their pfn value with flag like write permission)
>
> So source PFNs go into pfn_array[], I was thinking it contains destination PFNs.

In first pass it contains source pfn so driver don't have to walk CPU page table,
it can be ignore by driver that use CPU page table directly. It is only after device
driver callback that the device populate the array with destination memory.


> >
> > * hmm_migrate() isolate all lru pages collected in previous step. For ZONE_DEVICE pages
> > it does nothing. Non lru page can be migrated only if it is a ZONE_DEVICE page. Any
> > non lru page that is not ZONE_DEVICE is ignored.
>
> Hmm, may be because it does not have either page->pgmap (which you have extended to
> contain some driver specific callbacks) or page->as->aops (Minchan Kim's framework).
> Therefore any other kind of non LRU pages cannot migrate.
>
> >
> > * hmm_migrate() unmap all the pages and check the refcount. If there a page is pin then
> > it restore CPU page table, put back the page on lru (if it is not a ZONE_DEVICE page)
> > and clear the associated entry inside the pfn_array.
>
> Got it. pfn_array[] at the end will contain all PFNs which need to be migrated.

Yup

>
> >
> > * hmm_migrate() use device driver callback alloc_and_copy() this device driver callback
> > will allocate destination device page and copy from the source page. It uses the pfn
>
> So if the migration is from device to system RAM, alloc_and_copy() will allocate the
> destination system RAM pages and at that time pfn_array[] contains source device memory
> PFNs ? I am just trying see if it works both ways.

Yes, inside hmm_devmem* there is actualy an helper that do just that so device driver don't
have to worry about the device to system RAM direction. But device driver can choose to not
use hmm_devmem* and handle thing on their own (i would rather have device driver use common
helpers to avoid each device driver making different mistakes).


> > array to know which page can be migrated in the range (there is a flag). The callback
> > must also update the pfn_array and replace any entry that was successfully allocated
> > and copied with the pfn of the device page (and flag).
> >
> > * hmm_migrate() do the final struct page meta-data migration which might fail in case of
> > file back page (buffer head migration fails or radix tree fails ...)
> >
> > * hmm_migrate() update the CPU page table ie remove migration special entry to point
> > to new page if migration successfull or restore to old page otherwise. It also unlock
> > page and call put_page() on them either through lru put back or directly for
> > ZONE_DEVICE pages.
>
> If it's a ZONE_DEVICE page, the registered device driver also gets notified about it ?
> So that it can update it's own accounting regarding the allocated and free memory pages
> that it owns through a hot plugged ZONE_DEVICE zone ?
>
> >
> > * hmm_migrate() call cleanup() only now device driver can update its page table
>
> Though I still need to understand the page table mirroring part, I can clearly see
> that hmm_migrate() attempts to implement a parallel migrate_pages() kind of interface
> which can work with non LRU pages (right now ZONE_DEVICE based only) and a device
> driver. We will have to see whether this hmm_migrate() interface can accommodate all
> kind and direction of migration.
>
> Minchan Kim's framework enabled non LRU page migration in a different way. The device
> driver is suppose to create a stand alone struct address_space_operation and struct
> address_space and load them into each struct page with a call. Now all non LRU pages
> contains the stand alone struct address_space_operations as page->as->aops based
> callbacks.
>
> Now we have a different way of enabling non LRU device page migration by extending
> ZONE_DEVICE framework, does it overlap with the functionality already supported
> by the previous framework ? I am just curious.

I think Minchan is trying to allow migration for device driver kernel allocated memory
ie not memory that end inside a regular vma (non special vma) but only inside a device
driver file vma if at all. So we are targetting different problem.

Me i only care about "regular" process memory is private anonymous, or share memory
(either back by regular file or pure share memory). I do not want to mess with any of
the device driver vma or any special vma that are under control of an unknown device
driver. Trying to migrate any such special memory is just not going to work. Moreover
i believe it is not something we care in the first place. GPU will work on either the
regular process memory or some GPU specific memory but won't try to mess with other
device vma.

>
> >
> >
> > I slightly changing the last 2 step, it would be call device driver callback first
> > and then restore CPU page table and device driver callback would be rename to
> > finalize_and_map().
> >
> > So with this design:
> > 1. is a non-issue (use of pfn array and not list of page).
>
> Right.
>
> >
> > 2. is a non-issue successfull migration from ZONE_DEVICE (GPU memory) to system
> > memory call put_page() which in turn will call inside the device driver
> > to inform the device driver that page is free (assuming refcount on page
> > reach 1)
>
> Right.
>
> >
> > 3. New page is not part of the LRU if it is a device page. Assumption is that the
> > device driver wants to manage its memory by itself and LRU would interfer with
> > that. Moreover this is a device page and thus it is not something that should be
> > use for emergency memory allocation or any regular allocation. So it is pointless
> > for kernel to try to keep aging those pages to see when they can be reclaim.
>
> If the driver manages everything, these device memory pages need not be on the LRU after
> migration. But not being on any LRU makes it difficult for other core MM features to work
> on these pages any more. Almost all core mm interfaces expect the pages to be on LRU, IIUC.
> Though they all can be changed to accommodate non LRU pages but dont you think that can be
> a lot of work ? Just curious.

There is no code that assume that page is on lru. There is code that assume new page must
go on lru (the file system page read for instance). But all code path i went through (i try
to go over all of it but i might have miss thing) will gracefully handle a page that is not
on the lru. In all cases i have been through this just meant ignore the page. Which is what
i wanted in the first place :) for device memory to be left alone.

My hope is that at one point hardware will have enough commonality that implementing a
generic per device lru might make sense. Same for other kernel mm functionality.

>
> >
> > 4. I do not store address_space operation of a device, i extended struct dev_pagemap
> > to have more callback and this can be access through struct page->pgmap
> > So the for ZONE_DEVICE page the page->mapping point to the expected page->mapping
> > ie for anonymous page it points to the anon vma chain and for file back page it
> > points to the address space of the filesystem on which the file is.
>
> Right.
>
> >
> > 5. See 4 above
>
> Right.
>
> >
> > 6. I do not store any device driver specific address space operation inside struct
> > page. I do not see the need for that and doing so would require major changes to
> > kernel mm code. All the device driver cares about is being told when a page is
> > free (as i am assuming device does the allocation in the first place).
> >
>
> Minchan's work introduced the idea of PageMovable (IIUC, it just says its a movable
> non LRU page with page->mapping->aops and some struct page flags) and changed parts
> of the core MM migration and compaction functions to accommodate MovablePage.

Like i said above i think he is targeting device driver allocated page that are not
part of regular vma (private anonymous or share file) but are use by device driver.

>
> > It seems you want to rely on following struct address_space_operations callback:
> > void (*putback_page)(struct page *);
> > bool (*isolate_page)(struct page *, isolate_mode_t);
> > int (*migratepage) (...);
> >
> > For putback_page i added a free_page() to struct dev_pagemap which does the job.
>
> Right, sounds correct from this ZONE_DEVICE based framework.
>
> > I do not see need for isolate_page() and it would be bad as some filesystem do
> > special thing in that callback. If you update the CPU page table the device should
>
> It was a dummy device driver specific address_space_operations, hence its not related
> to any file system as such.
>
> > see that and i do not think you would need any special handling inside the device
> > driver code.
>
> I need to understand this part. How a call back from CPU page table update comes to
> the device driver, will go through HMM V13 for that.

It goes through the update() callback of hmm_mirror_ops which is part of hmm_mirror
struct. So device driver register an hmm_mirror against an mm which bind the device
to the given mm. Any update to CPU page table calls mmu_notifier and hmm forward those
call to device driver through hmm_mirror_ops.update().

Device driver does not use mmu_notifier directly because HMM provide a way to snapshot
CPU page table safely without worry from concurrent CPU page table update and without
locking CPU page table directory.

But all this is separate from migration or devmem, so i doubt it could be usefull with
CAPI bus. hmm_mirror is just really mmu_notifier with some sugar coating :)

Cheers,
Jérôme
Re: [RFC 0/8] Define coherent device memory node [ In reply to ]
On Fri, Oct 28, 2016 at 10:59:52AM +0530, Aneesh Kumar K.V wrote:
> Jerome Glisse <j.glisse@gmail.com> writes:
>
> > On Wed, Oct 26, 2016 at 04:39:19PM +0530, Aneesh Kumar K.V wrote:
> >> Jerome Glisse <j.glisse@gmail.com> writes:
> >>
> >> > On Tue, Oct 25, 2016 at 09:56:35AM +0530, Aneesh Kumar K.V wrote:
> >> >> Jerome Glisse <j.glisse@gmail.com> writes:
> >> >>
> >> >> > On Mon, Oct 24, 2016 at 10:01:49AM +0530, Anshuman Khandual wrote:
> >> >> >
> >> >> I looked at the hmm-v13 w.r.t migration and I guess some form of device
> >> >> callback/acceleration during migration is something we should definitely
> >> >> have. I still haven't figured out how non addressable and coherent device
> >> >> memory can fit together there. I was waiting for the page cache
> >> >> migration support to be pushed to the repository before I start looking
> >> >> at this closely.
> >> >>
> >> >
> >> > The page cache migration does not touch the migrate code path. My issue with
> >> > page cache is writeback. The only difference with existing migrate code is
> >> > refcount check for ZONE_DEVICE page. Everything else is the same.
> >>
> >> What about the radix tree ? does file system migrate_page callback handle
> >> replacing normal page with ZONE_DEVICE page/exceptional entries ?
> >>
> >
> > It use the exact same existing code (from mm/migrate.c) so yes the radix tree
> > is updated and buffer_head are migrated.
> >
>
> I looked at the the page cache migration patches shared and I find that
> you are not using exceptional entries when we migrate a page cache page to
> device memory. But I am now not sure how a read from page cache will
> work with that.
>
> ie, a file system read will now find the page in page cache. But we
> cannot do a copy_to_user of that page because that is now backed by an
> unaddressable memory right ?
>
> do_generic_file_read() does
> page = find_get_page(mapping, index);
> ....
> ret = copy_page_to_iter(page, offset, nr, iter);
>
> which does
> void *kaddr = kmap_atomic(page);
> size_t wanted = copy_to_iter(kaddr + offset, bytes, i);
> kunmap_atomic(kaddr);

Like i said right now for un-addressable memory my patches are mostly broken.
For read and write. I am focusing on page write back for now as it seemed to
be the more problematic case. For read/write the intention is to trigger a
migration back to system memory inside read/write of filesystem. This is also
why i will need a flag to indicate if a filesystem support migration to
un-addressable memory.

But in your case where the device memory is accessible then it should just work,
or do you need to do special thing when kmaping device page ?

Cheers,
Jérôme
Re: [RFC 0/8] Define coherent device memory node [ In reply to ]
On 10/25/2016 09:56 AM, Aneesh Kumar K.V wrote:
> I looked at the hmm-v13 w.r.t migration and I guess some form of device
> callback/acceleration during migration is something we should definitely
> have. I still haven't figured out how non addressable and coherent device
> memory can fit together there. I was waiting for the page cache
> migration support to be pushed to the repository before I start looking
> at this closely.

Aneesh, did not get that. Currently basic page cache migration is supported,
right ? The device callback during migration, fault etc are supported through
page->pgmap pointer and extending dev_pagemap structure to accommodate new
members. IIUC that is the reason ZONE_DEVICE is being modified so that page
->pgmap overloading can be used for various driver/device specific callbacks
while inside core VM functions or HMM functions.

HMM V13 has introduced non-addressable ZONE_DEVICE based device memory which
can have it's struct pages in system RAM but they cannot be accessed from the
CPU. Now coherent device memory is kind of similar to persistent memory like
NVDIMM which is already supported through ZONE_DEVICE (though we might not
want to use vmemap_altmap instead have the struct pages in the system RAM).
Now HMM has to learn working with 'dev_pagemap->addressable' type of device
memory and then support all possible migrations through it's API. So in a
nutshell, these are the changes we need to do to make HMM work with coherent
device memory.

(0) Support all possible migrations between system RAM and device memory
for current un-addressable device memory and make the HMM migration
API layer comprehensive and complete.

(1) Create coherent device memory representation in ZONE_DEVICE
(a) Make it exactly the same as that of persistent memory/NVDIMM

or

(b) Create a new type for coherent device memory representation

(2) Support all possible migrations between system RAM and device memory
for new addressable coherent device memory represented in ZONE_DEVICE
extending the HMM migration API layer.

Right now, HMM V13 patch series supports migration for a subset of private
anonymous pages for un-addressable device memory. I am wondering how difficult
is it to implement all possible anon, file mapping migration support for both
un-addressable and addressable coherent device memory through ZONE_DEVICE.
Re: [RFC 0/8] Define coherent device memory node [ In reply to ]
On Sat, Nov 05, 2016 at 10:51:21AM +0530, Anshuman Khandual wrote:
> On 10/25/2016 09:56 AM, Aneesh Kumar K.V wrote:
> > I looked at the hmm-v13 w.r.t migration and I guess some form of device
> > callback/acceleration during migration is something we should definitely
> > have. I still haven't figured out how non addressable and coherent device
> > memory can fit together there. I was waiting for the page cache
> > migration support to be pushed to the repository before I start looking
> > at this closely.
>
> Aneesh, did not get that. Currently basic page cache migration is supported,
> right ? The device callback during migration, fault etc are supported through
> page->pgmap pointer and extending dev_pagemap structure to accommodate new
> members. IIUC that is the reason ZONE_DEVICE is being modified so that page
> ->pgmap overloading can be used for various driver/device specific callbacks
> while inside core VM functions or HMM functions.
>
> HMM V13 has introduced non-addressable ZONE_DEVICE based device memory which
> can have it's struct pages in system RAM but they cannot be accessed from the
> CPU. Now coherent device memory is kind of similar to persistent memory like
> NVDIMM which is already supported through ZONE_DEVICE (though we might not
> want to use vmemap_altmap instead have the struct pages in the system RAM).
> Now HMM has to learn working with 'dev_pagemap->addressable' type of device
> memory and then support all possible migrations through it's API. So in a
> nutshell, these are the changes we need to do to make HMM work with coherent
> device memory.
>
> (0) Support all possible migrations between system RAM and device memory
> for current un-addressable device memory and make the HMM migration
> API layer comprehensive and complete.

What is no comprehensive or complete in the API layer ? I think the API is
pretty clear the migrate function does not rely on anything except HMM pfn.


>
> (1) Create coherent device memory representation in ZONE_DEVICE
> (a) Make it exactly the same as that of persistent memory/NVDIMM
>
> or
>
> (b) Create a new type for coherent device memory representation

So i will soon push an updated tree with modification to HMM API (from device
driver point of view but the migrate stuff is virtually the same). I slpitted
the addressable and movable concept and thus it is now easy to support coherent
addressable memory and non addressable memory.

>
> (2) Support all possible migrations between system RAM and device memory
> for new addressable coherent device memory represented in ZONE_DEVICE
> extending the HMM migration API layer.
>
> Right now, HMM V13 patch series supports migration for a subset of private
> anonymous pages for un-addressable device memory. I am wondering how difficult
> is it to implement all possible anon, file mapping migration support for both
> un-addressable and addressable coherent device memory through ZONE_DEVICE.
>

There is no need to extend the API to support file back as matter of fact the
2 patches i sent you do support migration of file back page (page->mapping)
to and from ZONE_DEVICE as long as this ZONE_DEVICE memory is accessible by
the CPU and coherent. What i am still working on is the non addressable case
that is way more tedious (handle direct IO, read, write and writeback).

So difficulty for coherent memory is nill, it is the non addressable memory that
is hard to support in respect to file back page.

Cheers,
Jérôme

1 2  View All