Mailing List Archive

[RFC 0/8] Define coherent device memory node
There are certain devices like accelerators, GPU cards, network
cards, FPGA cards, PLD cards etc which might contain on board memory. This
on board memory can be coherent along with system RAM and may be accessible
from either the CPU or from the device. The coherency is usually achieved
through synchronizing the cache accesses from either side. This makes the
device memory appear in the same address space as that of the system RAM.
The on board device memory and system RAM are coherent but have differences
in their properties as explained and elaborated below. Following diagram
explains how the coherent device memory appears in the memory address
space.

+-----------------+ +-----------------+
| | | |
| CPU | | DEVICE |
| | | |
+-----------------+ +-----------------+
| |
| Shared Address Space |
+---------------------------------------------------------------------+
| | |
| | |
| System RAM | Coherent Memory |
| | |
| | |
+---------------------------------------------------------------------+

User space applications might be interested in using the coherent
device memory either explicitly or implicitly along with the system RAM
utilizing the basic semantics for memory allocation, access and release.
Basically the user applications should be able to allocate memory any where
(system RAM or coherent memory) and then get it accessed either from the
CPU or from the coherent device for various computation or data
transformation purpose. User space really should not be concerned about
memory placement and their subsequent allocations when the memory really
faults because of the access.

To achieve seamless integration between system RAM and coherent
device memory it must be able to utilize core memory kernel features like
anon mapping, file mapping, page cache, driver managed pages, HW poisoning,
migrations, reclaim, compaction, etc. Making the coherent device memory
appear as a distinct memory only NUMA node which will be initialized as any
other node with memory can create this integration with currently available
system RAM memory. Also at the same time there should be a differentiating
mark which indicates that this node is a coherent device memory node not
any other memory only system RAM node.

Coherent device memory invariably isn't available until the driver
for the device has been initialized. It is desirable but not required for
the device to support memory offlining for the purposes such as power
management, link management and hardware errors. Kernel allocation should
not come here as it cannot be moved out. Hence coherent device memory
should go inside ZONE_MOVABLE zone instead. This guarantees that kernel
allocations will never be satisfied from this memory and any process having
un-movable pages on this coherent device memory (likely achieved through
pinning later on after initial allocation) can be killed to free up memory
from page table and eventually hot plugging the node out.

After similar representation as a NUMA node, the coherent memory
might still need some special consideration while being inside the kernel.
There can be a variety of coherent device memory nodes with different
expectations and special considerations from the core kernel. This RFC
discusses only one such scenario where the coherent device memory requires
just isolation.

Now let us consider in detail the case of a coherent device memory
node which requires isolation. This kind of coherent device memory is on
board an external device attached to the system through a link where there
is a chance of link errors plugging out the entire memory node with it.
More over the memory might also have higher chances of ECC errors as
compared to the system RAM. These are just some possibilities. But the fact
remains that the coherent device memory can have some other different
properties which might not be desirable for some user space applications.
An application should not be exposed to related risks of a device if its
not taking advantage of special features of that device and it's memory.

Because of the reasons explained above allocations into isolation
based coherent device memory node should further be regulated apart from
earlier requirement of kernel allocations not coming there. User space
allocations should not come here implicitly without the user application
explicitly knowing about it. This summarizes isolation requirement of
certain kind of a coherent device memory node as an example.

Some coherent memory devices may not require isolation altogether.
Then there might be other coherent memory devices which require some other
special treatment after being part of core memory representation in kernel.
Though the framework suggested by this RFC has made provisions for them, it
has not considered any other kind of requirement other than isolation for
now.

Though this RFC series currently attempts to implement one such
isolation seeking coherent device memory example, this framework can be
extended to accommodate any present or future coherent memory devices which
will fit the requirement as explained before even with new requirements
other than isolation. In case of isolation seeking coherent device memory
node, there will be other core VM code paths which need to be taken care
before it can be completely isolated as required.

Core kernel memory features like reclamation, evictions etc. might
need to be restricted or modified on the coherent device memory node as
they can be performance limiting. The RFC does not propose anything on this
yet but it can be looked into later on. For now it just disables Auto NUMA
for any VMA which has coherent device memory.

Seamless integration of coherent device memory with system memory
will enable various other features, some of which can be listed as follows.

a. Seamless migrations between system RAM and the coherent memory
b. Will have asynchronous and high throughput migrations
c. Be able to allocate huge order pages from these memory regions
d. Restrict allocations to a large extent to the tasks using the
device for workload acceleration

Before concluding, will look into the reasons why the existing
solutions don't work. There are two basic requirements which have to be
satisfies before the coherent device memory can be integrated with core
kernel seamlessly.

a. PFN must have struct page
b. Struct page must able to be inside standard LRU lists

The above two basic requirements discard the existing method of
device memory representation approaches like these which then requires the
need of creating a new framework.

(1) Traditional ioremap

a. Memory is mapped into kernel (linear and virtual) and user space
b. These PFNs do not have struct pages associated with it
c. These special PFNs are marked with special flags inside the PTE
d. Cannot participate in core VM functions much because of this
e. Cannot do easy user space migrations

(2) Zone ZONE_DEVICE

a. Memory is mapped into kernel and user space
b. PFNs do have struct pages associated with it
c. These struct pages are allocated inside it's own memory range
d. Unfortunately the struct page's union containing LRU has been
used for struct dev_pagemap pointer
e. Hence it cannot be part of any LRU (like Page cache)
f. Hence file cached mapping cannot reside on these PFNs
g. Cannot do easy migrations

I had also explored non LRU representation of this coherent device
memory where the integration with system RAM in the core VM is limited only
to the following functions. Not being inside LRU is definitely going to
reduce the scope of tight integration with system RAM.

(1) Migration support between system RAM and coherent memory
(2) Migration support between various coherent memory nodes
(3) Isolation of the coherent memory
(4) Mapping the coherent memory into user space through driver's
struct vm_operations
(5) HW poisoning of the coherent memory

Allocating the entire memory of the coherent device node right
after hot plug into ZONE_MOVABLE (where the memory is already inside the
buddy system) will still expose a time window where other user space
allocations can come into the coherent device memory node and prevent the
intended isolation. So traditional hot plug is not the solution. Hence
started looking into CMA based non LRU solution but then hit the following
roadblocks.

(1) CMA does not support hot plugging of new memory node
a. CMA area needs to be marked during boot before buddy is
initialized
b. cma_alloc()/cma_release() can happen on the marked area
c. Should be able to mark the CMA areas just after memory hot plug
d. cma_alloc()/cma_release() can happen later after the hot plug
e. This is not currently supported right now

(2) Mapped non LRU migration of pages
a. Recent work from Michan Kim makes non LRU page migratable
b. But it still does not support migration of mapped non LRU pages
c. With non LRU CMA reserved, again there are some additional
challenges

With hot pluggable CMA and non LRU mapped migration support there
may be an alternate approach to represent coherent device memory. Please
do review this RFC proposal and let me know your comments or suggestions.
Thank you.

Anshuman Khandual (8):
mm: Define coherent device memory node
mm: Add specialized fallback zonelist for coherent device memory nodes
mm: Isolate coherent device memory nodes from HugeTLB allocation paths
mm: Accommodate coherent device memory nodes in MPOL_BIND implementation
mm: Add new flag VM_CDM for coherent device memory
mm: Make VM_CDM marked VMAs non migratable
mm: Add a new migration function migrate_virtual_range()
mm: Add N_COHERENT_DEVICE node type into node_states[]

Documentation/ABI/stable/sysfs-devices-node | 7 +++
drivers/base/node.c | 6 +++
include/linux/mempolicy.h | 24 +++++++++
include/linux/migrate.h | 3 ++
include/linux/mm.h | 5 ++
include/linux/mmzone.h | 29 ++++++++++
include/linux/nodemask.h | 3 ++
mm/Kconfig | 13 +++++
mm/hugetlb.c | 38 ++++++++++++-
mm/memory_hotplug.c | 10 ++++
mm/mempolicy.c | 70 ++++++++++++++++++++++--
mm/migrate.c | 84 +++++++++++++++++++++++++++++
mm/page_alloc.c | 10 ++++
13 files changed, 295 insertions(+), 7 deletions(-)

--
2.1.0
Re: [RFC 0/8] Define coherent device memory node [ In reply to ]
On Mon, Oct 24, 2016 at 10:01:49AM +0530, Anshuman Khandual wrote:

> [...]

> Core kernel memory features like reclamation, evictions etc. might
> need to be restricted or modified on the coherent device memory node as
> they can be performance limiting. The RFC does not propose anything on this
> yet but it can be looked into later on. For now it just disables Auto NUMA
> for any VMA which has coherent device memory.
>
> Seamless integration of coherent device memory with system memory
> will enable various other features, some of which can be listed as follows.
>
> a. Seamless migrations between system RAM and the coherent memory
> b. Will have asynchronous and high throughput migrations
> c. Be able to allocate huge order pages from these memory regions
> d. Restrict allocations to a large extent to the tasks using the
> device for workload acceleration
>
> Before concluding, will look into the reasons why the existing
> solutions don't work. There are two basic requirements which have to be
> satisfies before the coherent device memory can be integrated with core
> kernel seamlessly.
>
> a. PFN must have struct page
> b. Struct page must able to be inside standard LRU lists
>
> The above two basic requirements discard the existing method of
> device memory representation approaches like these which then requires the
> need of creating a new framework.

I do not believe the LRU list is a hard requirement, yes when faulting in
a page inside the page cache it assumes it needs to be added to lru list.
But i think this can easily be work around.

In HMM i am using ZONE_DEVICE and because memory is not accessible from CPU
(not everyone is bless with decent system bus like CAPI, CCIX, Gen-Z, ...)
so in my case a file back page must always be spawn first from a regular
page and once read from disk then i can migrate to GPU page.

So if you accept this intermediary step you can easily use ZONE_DEVICE for
device memory. This way no lru, no complex dance to make the memory out of
reach from regular memory allocator.

I think we would have much to gain if we pool our effort on a single common
solution for device memory. In my case the device memory is not accessible
by the CPU (because PCIE restrictions), in your case it is. Thus the only
difference is that in my case it can not be map inside the CPU page table
while in yours it can.

>
> (1) Traditional ioremap
>
> a. Memory is mapped into kernel (linear and virtual) and user space
> b. These PFNs do not have struct pages associated with it
> c. These special PFNs are marked with special flags inside the PTE
> d. Cannot participate in core VM functions much because of this
> e. Cannot do easy user space migrations
>
> (2) Zone ZONE_DEVICE
>
> a. Memory is mapped into kernel and user space
> b. PFNs do have struct pages associated with it
> c. These struct pages are allocated inside it's own memory range
> d. Unfortunately the struct page's union containing LRU has been
> used for struct dev_pagemap pointer
> e. Hence it cannot be part of any LRU (like Page cache)
> f. Hence file cached mapping cannot reside on these PFNs
> g. Cannot do easy migrations
>
> I had also explored non LRU representation of this coherent device
> memory where the integration with system RAM in the core VM is limited only
> to the following functions. Not being inside LRU is definitely going to
> reduce the scope of tight integration with system RAM.
>
> (1) Migration support between system RAM and coherent memory
> (2) Migration support between various coherent memory nodes
> (3) Isolation of the coherent memory
> (4) Mapping the coherent memory into user space through driver's
> struct vm_operations
> (5) HW poisoning of the coherent memory
>
> Allocating the entire memory of the coherent device node right
> after hot plug into ZONE_MOVABLE (where the memory is already inside the
> buddy system) will still expose a time window where other user space
> allocations can come into the coherent device memory node and prevent the
> intended isolation. So traditional hot plug is not the solution. Hence
> started looking into CMA based non LRU solution but then hit the following
> roadblocks.
>
> (1) CMA does not support hot plugging of new memory node
> a. CMA area needs to be marked during boot before buddy is
> initialized
> b. cma_alloc()/cma_release() can happen on the marked area
> c. Should be able to mark the CMA areas just after memory hot plug
> d. cma_alloc()/cma_release() can happen later after the hot plug
> e. This is not currently supported right now
>
> (2) Mapped non LRU migration of pages
> a. Recent work from Michan Kim makes non LRU page migratable
> b. But it still does not support migration of mapped non LRU pages
> c. With non LRU CMA reserved, again there are some additional
> challenges
>
> With hot pluggable CMA and non LRU mapped migration support there
> may be an alternate approach to represent coherent device memory. Please
> do review this RFC proposal and let me know your comments or suggestions.
> Thank you.

You can take a look at hmm-v13 if you want to see how i do non LRU page
migration. While i put most of the migration code inside hmm_migrate.c it
could easily be move to migrate.c without hmm_ prefix.

There is 2 missing piece with existing migrate code. First is to put memory
allocation for destination under control of who call the migrate code. Second
is to allow offloading the copy operation to device (ie not use the CPU to
copy data).

I believe same requirement also make sense for platform you are targeting.
Thus same code can be use.

hmm-v13 https://cgit.freedesktop.org/~glisse/linux/log/?h=hmm-v13

I haven't posted this patchset yet because we are doing some modifications
to the device driver API to accomodate some new features. But the ZONE_DEVICE
changes and the overall migration code will stay the same more or less (i have
patches that move it to migrate.c and share more code with existing migrate
code).

If you think i missed anything about lru and page cache please point it to
me. Because when i audited code for that i didn't see any road block with
the few fs i was looking at (ext4, xfs and core page cache code).

> [...]

Cheers,
Jérôme
Re: [RFC 0/8] Define coherent device memory node [ In reply to ]
On 10/23/2016 09:31 PM, Anshuman Khandual wrote:
> To achieve seamless integration between system RAM and coherent
> device memory it must be able to utilize core memory kernel features like
> anon mapping, file mapping, page cache, driver managed pages, HW poisoning,
> migrations, reclaim, compaction, etc.

So, you need to support all these things, but not autonuma or hugetlbfs?
What's the reasoning behind that?

If you *really* don't want a "cdm" page to be migrated, then why isn't
that policy set on the VMA in the first place? That would keep "cdm"
pages from being made non-cdm. And, why would autonuma ever make a
non-cdm page and migrate it in to cdm? There will be no NUMA access
faults caused by the devices that are fed to autonuma.

I'm confused.
Re: [RFC 0/8] Define coherent device memory node [ In reply to ]
On 10/24/2016 01:04 PM, Dave Hansen wrote:

> On 10/23/2016 09:31 PM, Anshuman Khandual wrote:
>> To achieve seamless integration between system RAM and coherent
>> device memory it must be able to utilize core memory kernel features like
>> anon mapping, file mapping, page cache, driver managed pages, HW poisoning,
>> migrations, reclaim, compaction, etc.
> So, you need to support all these things, but not autonuma or hugetlbfs?
> What's the reasoning behind that?
>
> If you *really* don't want a "cdm" page to be migrated, then why isn't
> that policy set on the VMA in the first place? That would keep "cdm"
> pages from being made non-cdm. And, why would autonuma ever make a
> non-cdm page and migrate it in to cdm? There will be no NUMA access
> faults caused by the devices that are fed to autonuma.
>
Pages are desired to be migrateable, both into (starting cpu zone
movable->cdm) and out of (starting cdm->cpu zone movable) but only
through explicit migration, not via autonuma. other pages in the same
VMA should still be migrateable between CPU nodes via autonuma however.

Its expected a lot of these allocations are going to end up in THPs.
I'm not sure we need to explicitly disallow hugetlbfs support but the
identified use case is definitely via THPs not tlbfs.
Re: [RFC 0/8] Define coherent device memory node [ In reply to ]
On 10/24/2016 11:32 AM, David Nellans wrote:
> On 10/24/2016 01:04 PM, Dave Hansen wrote:
>> If you *really* don't want a "cdm" page to be migrated, then why isn't
>> that policy set on the VMA in the first place? That would keep "cdm"
>> pages from being made non-cdm. And, why would autonuma ever make a
>> non-cdm page and migrate it in to cdm? There will be no NUMA access
>> faults caused by the devices that are fed to autonuma.
>>
> Pages are desired to be migrateable, both into (starting cpu zone
> movable->cdm) and out of (starting cdm->cpu zone movable) but only
> through explicit migration, not via autonuma.

OK, and is there a reason that the existing mbind code plus NUMA
policies fails to give you this behavior?

Does autonuma somehow override strict NUMA binding?

> other pages in the same
> VMA should still be migrateable between CPU nodes via autonuma however.

That's not the way the implementation here works, as I understand it.
See the VM_CDM patch and my responses to it.

> Its expected a lot of these allocations are going to end up in THPs.
> I'm not sure we need to explicitly disallow hugetlbfs support but the
> identified use case is definitely via THPs not tlbfs.

I think THP and hugetlbfs are implementations, not use cases. :)

Is it too hard to support hugetlbfs that we should complicate its code
to exclude it from this type of memory? Why?
Re: [RFC 0/8] Define coherent device memory node [ In reply to ]
Jerome Glisse <j.glisse@gmail.com> writes:

> On Mon, Oct 24, 2016 at 10:01:49AM +0530, Anshuman Khandual wrote:
>
>> [...]
>
>> Core kernel memory features like reclamation, evictions etc. might
>> need to be restricted or modified on the coherent device memory node as
>> they can be performance limiting. The RFC does not propose anything on this
>> yet but it can be looked into later on. For now it just disables Auto NUMA
>> for any VMA which has coherent device memory.
>>
>> Seamless integration of coherent device memory with system memory
>> will enable various other features, some of which can be listed as follows.
>>
>> a. Seamless migrations between system RAM and the coherent memory
>> b. Will have asynchronous and high throughput migrations
>> c. Be able to allocate huge order pages from these memory regions
>> d. Restrict allocations to a large extent to the tasks using the
>> device for workload acceleration
>>
>> Before concluding, will look into the reasons why the existing
>> solutions don't work. There are two basic requirements which have to be
>> satisfies before the coherent device memory can be integrated with core
>> kernel seamlessly.
>>
>> a. PFN must have struct page
>> b. Struct page must able to be inside standard LRU lists
>>
>> The above two basic requirements discard the existing method of
>> device memory representation approaches like these which then requires the
>> need of creating a new framework.
>
> I do not believe the LRU list is a hard requirement, yes when faulting in
> a page inside the page cache it assumes it needs to be added to lru list.
> But i think this can easily be work around.
>
> In HMM i am using ZONE_DEVICE and because memory is not accessible from CPU
> (not everyone is bless with decent system bus like CAPI, CCIX, Gen-Z, ...)
> so in my case a file back page must always be spawn first from a regular
> page and once read from disk then i can migrate to GPU page.
>
> So if you accept this intermediary step you can easily use ZONE_DEVICE for
> device memory. This way no lru, no complex dance to make the memory out of
> reach from regular memory allocator.

One of the reason to look at this as a NUMA node is to allow things like
over-commit of coherent device memory. The pages backing CDM being part of
lru and considering the coherent device as a numa node makes that really
simpler (we can run kswapd for that node).


>
> I think we would have much to gain if we pool our effort on a single common
> solution for device memory. In my case the device memory is not accessible
> by the CPU (because PCIE restrictions), in your case it is. Thus the only
> difference is that in my case it can not be map inside the CPU page table
> while in yours it can.

IMHO, we should be able to share the HMM migration approach. We
definitely won't need the mirror page table part. That is one of the
reson I requested HMM mirror page table to be a seperate patchset.


>
>>
>> (1) Traditional ioremap
>>
>> a. Memory is mapped into kernel (linear and virtual) and user space
>> b. These PFNs do not have struct pages associated with it
>> c. These special PFNs are marked with special flags inside the PTE
>> d. Cannot participate in core VM functions much because of this
>> e. Cannot do easy user space migrations
>>
>> (2) Zone ZONE_DEVICE
>>
>> a. Memory is mapped into kernel and user space
>> b. PFNs do have struct pages associated with it
>> c. These struct pages are allocated inside it's own memory range
>> d. Unfortunately the struct page's union containing LRU has been
>> used for struct dev_pagemap pointer
>> e. Hence it cannot be part of any LRU (like Page cache)
>> f. Hence file cached mapping cannot reside on these PFNs
>> g. Cannot do easy migrations
>>
>> I had also explored non LRU representation of this coherent device
>> memory where the integration with system RAM in the core VM is limited only
>> to the following functions. Not being inside LRU is definitely going to
>> reduce the scope of tight integration with system RAM.
>>
>> (1) Migration support between system RAM and coherent memory
>> (2) Migration support between various coherent memory nodes
>> (3) Isolation of the coherent memory
>> (4) Mapping the coherent memory into user space through driver's
>> struct vm_operations
>> (5) HW poisoning of the coherent memory
>>
>> Allocating the entire memory of the coherent device node right
>> after hot plug into ZONE_MOVABLE (where the memory is already inside the
>> buddy system) will still expose a time window where other user space
>> allocations can come into the coherent device memory node and prevent the
>> intended isolation. So traditional hot plug is not the solution. Hence
>> started looking into CMA based non LRU solution but then hit the following
>> roadblocks.
>>
>> (1) CMA does not support hot plugging of new memory node
>> a. CMA area needs to be marked during boot before buddy is
>> initialized
>> b. cma_alloc()/cma_release() can happen on the marked area
>> c. Should be able to mark the CMA areas just after memory hot plug
>> d. cma_alloc()/cma_release() can happen later after the hot plug
>> e. This is not currently supported right now
>>
>> (2) Mapped non LRU migration of pages
>> a. Recent work from Michan Kim makes non LRU page migratable
>> b. But it still does not support migration of mapped non LRU pages
>> c. With non LRU CMA reserved, again there are some additional
>> challenges
>>
>> With hot pluggable CMA and non LRU mapped migration support there
>> may be an alternate approach to represent coherent device memory. Please
>> do review this RFC proposal and let me know your comments or suggestions.
>> Thank you.
>
> You can take a look at hmm-v13 if you want to see how i do non LRU page
> migration. While i put most of the migration code inside hmm_migrate.c it
> could easily be move to migrate.c without hmm_ prefix.
>
> There is 2 missing piece with existing migrate code. First is to put memory
> allocation for destination under control of who call the migrate code. Second
> is to allow offloading the copy operation to device (ie not use the CPU to
> copy data).
>
> I believe same requirement also make sense for platform you are targeting.
> Thus same code can be use.
>
> hmm-v13 https://cgit.freedesktop.org/~glisse/linux/log/?h=hmm-v13
>
> I haven't posted this patchset yet because we are doing some modifications
> to the device driver API to accomodate some new features. But the ZONE_DEVICE
> changes and the overall migration code will stay the same more or less (i have
> patches that move it to migrate.c and share more code with existing migrate
> code).
>
> If you think i missed anything about lru and page cache please point it to
> me. Because when i audited code for that i didn't see any road block with
> the few fs i was looking at (ext4, xfs and core page cache code).

I looked at the hmm-v13 w.r.t migration and I guess some form of device
callback/acceleration during migration is something we should definitely
have. I still haven't figured out how non addressable and coherent device
memory can fit together there. I was waiting for the page cache
migration support to be pushed to the repository before I start looking
at this closely.

-aneesh
Re: [RFC 0/8] Define coherent device memory node [ In reply to ]
Jerome Glisse <j.glisse@gmail.com> writes:

> On Mon, Oct 24, 2016 at 10:01:49AM +0530, Anshuman Khandual wrote:
>
>> [...]
>
>> Core kernel memory features like reclamation, evictions etc. might
>> need to be restricted or modified on the coherent device memory node as
>> they can be performance limiting. The RFC does not propose anything on this
>> yet but it can be looked into later on. For now it just disables Auto NUMA
>> for any VMA which has coherent device memory.
>>
>> Seamless integration of coherent device memory with system memory
>> will enable various other features, some of which can be listed as follows.
>>
>> a. Seamless migrations between system RAM and the coherent memory
>> b. Will have asynchronous and high throughput migrations
>> c. Be able to allocate huge order pages from these memory regions
>> d. Restrict allocations to a large extent to the tasks using the
>> device for workload acceleration
>>
>> Before concluding, will look into the reasons why the existing
>> solutions don't work. There are two basic requirements which have to be
>> satisfies before the coherent device memory can be integrated with core
>> kernel seamlessly.
>>
>> a. PFN must have struct page
>> b. Struct page must able to be inside standard LRU lists
>>
>> The above two basic requirements discard the existing method of
>> device memory representation approaches like these which then requires the
>> need of creating a new framework.
>
> I do not believe the LRU list is a hard requirement, yes when faulting in
> a page inside the page cache it assumes it needs to be added to lru list.
> But i think this can easily be work around.
>
> In HMM i am using ZONE_DEVICE and because memory is not accessible from CPU
> (not everyone is bless with decent system bus like CAPI, CCIX, Gen-Z, ...)
> so in my case a file back page must always be spawn first from a regular
> page and once read from disk then i can migrate to GPU page.
>
> So if you accept this intermediary step you can easily use ZONE_DEVICE for
> device memory. This way no lru, no complex dance to make the memory out of
> reach from regular memory allocator.
>
> I think we would have much to gain if we pool our effort on a single common
> solution for device memory. In my case the device memory is not accessible
> by the CPU (because PCIE restrictions), in your case it is. Thus the only
> difference is that in my case it can not be map inside the CPU page table
> while in yours it can.
>
>>
>> (1) Traditional ioremap
>>
>> a. Memory is mapped into kernel (linear and virtual) and user space
>> b. These PFNs do not have struct pages associated with it
>> c. These special PFNs are marked with special flags inside the PTE
>> d. Cannot participate in core VM functions much because of this
>> e. Cannot do easy user space migrations
>>
>> (2) Zone ZONE_DEVICE
>>
>> a. Memory is mapped into kernel and user space
>> b. PFNs do have struct pages associated with it
>> c. These struct pages are allocated inside it's own memory range
>> d. Unfortunately the struct page's union containing LRU has been
>> used for struct dev_pagemap pointer
>> e. Hence it cannot be part of any LRU (like Page cache)
>> f. Hence file cached mapping cannot reside on these PFNs
>> g. Cannot do easy migrations
>>
>> I had also explored non LRU representation of this coherent device
>> memory where the integration with system RAM in the core VM is limited only
>> to the following functions. Not being inside LRU is definitely going to
>> reduce the scope of tight integration with system RAM.
>>
>> (1) Migration support between system RAM and coherent memory
>> (2) Migration support between various coherent memory nodes
>> (3) Isolation of the coherent memory
>> (4) Mapping the coherent memory into user space through driver's
>> struct vm_operations
>> (5) HW poisoning of the coherent memory
>>
>> Allocating the entire memory of the coherent device node right
>> after hot plug into ZONE_MOVABLE (where the memory is already inside the
>> buddy system) will still expose a time window where other user space
>> allocations can come into the coherent device memory node and prevent the
>> intended isolation. So traditional hot plug is not the solution. Hence
>> started looking into CMA based non LRU solution but then hit the following
>> roadblocks.
>>
>> (1) CMA does not support hot plugging of new memory node
>> a. CMA area needs to be marked during boot before buddy is
>> initialized
>> b. cma_alloc()/cma_release() can happen on the marked area
>> c. Should be able to mark the CMA areas just after memory hot plug
>> d. cma_alloc()/cma_release() can happen later after the hot plug
>> e. This is not currently supported right now
>>
>> (2) Mapped non LRU migration of pages
>> a. Recent work from Michan Kim makes non LRU page migratable
>> b. But it still does not support migration of mapped non LRU pages
>> c. With non LRU CMA reserved, again there are some additional
>> challenges
>>
>> With hot pluggable CMA and non LRU mapped migration support there
>> may be an alternate approach to represent coherent device memory. Please
>> do review this RFC proposal and let me know your comments or suggestions.
>> Thank you.
>
> You can take a look at hmm-v13 if you want to see how i do non LRU page
> migration. While i put most of the migration code inside hmm_migrate.c it
> could easily be move to migrate.c without hmm_ prefix.
>
> There is 2 missing piece with existing migrate code. First is to put memory
> allocation for destination under control of who call the migrate code. Second
> is to allow offloading the copy operation to device (ie not use the CPU to
> copy data).
>
> I believe same requirement also make sense for platform you are targeting.
> Thus same code can be use.
>
> hmm-v13 https://cgit.freedesktop.org/~glisse/linux/log/?h=hmm-v13
>
> I haven't posted this patchset yet because we are doing some modifications
> to the device driver API to accomodate some new features. But the ZONE_DEVICE
> changes and the overall migration code will stay the same more or less (i have
> patches that move it to migrate.c and share more code with existing migrate
> code).
>
> If you think i missed anything about lru and page cache please point it to
> me. Because when i audited code for that i didn't see any road block with
> the few fs i was looking at (ext4, xfs and core page cache code).
>

The other restriction around ZONE_DEVICE is, it is not a managed zone.
That prevents any direct allocation from coherent device by application.
ie, we would like to force allocation from coherent device using
interface like mbind(MPOL_BIND..) . Is that possible with ZONE_DEVICE ?

-aneeesh
Re: [RFC 0/8] Define coherent device memory node [ In reply to ]
On 25/10/16 04:09, Jerome Glisse wrote:
> On Mon, Oct 24, 2016 at 10:01:49AM +0530, Anshuman Khandual wrote:
>
>> [...]
>
>> Core kernel memory features like reclamation, evictions etc. might
>> need to be restricted or modified on the coherent device memory node as
>> they can be performance limiting. The RFC does not propose anything on this
>> yet but it can be looked into later on. For now it just disables Auto NUMA
>> for any VMA which has coherent device memory.
>>
>> Seamless integration of coherent device memory with system memory
>> will enable various other features, some of which can be listed as follows.
>>
>> a. Seamless migrations between system RAM and the coherent memory
>> b. Will have asynchronous and high throughput migrations
>> c. Be able to allocate huge order pages from these memory regions
>> d. Restrict allocations to a large extent to the tasks using the
>> device for workload acceleration
>>
>> Before concluding, will look into the reasons why the existing
>> solutions don't work. There are two basic requirements which have to be
>> satisfies before the coherent device memory can be integrated with core
>> kernel seamlessly.
>>
>> a. PFN must have struct page
>> b. Struct page must able to be inside standard LRU lists
>>
>> The above two basic requirements discard the existing method of
>> device memory representation approaches like these which then requires the
>> need of creating a new framework.
>
> I do not believe the LRU list is a hard requirement, yes when faulting in
> a page inside the page cache it assumes it needs to be added to lru list.
> But i think this can easily be work around.
>
> In HMM i am using ZONE_DEVICE and because memory is not accessible from CPU
> (not everyone is bless with decent system bus like CAPI, CCIX, Gen-Z, ...)
> so in my case a file back page must always be spawn first from a regular
> page and once read from disk then i can migrate to GPU page.
>

I've not seen the HMM patchset, but read from disk will go to ZONE_DEVICE?
Then get migrated?

> So if you accept this intermediary step you can easily use ZONE_DEVICE for
> device memory. This way no lru, no complex dance to make the memory out of
> reach from regular memory allocator.
>
> I think we would have much to gain if we pool our effort on a single common
> solution for device memory. In my case the device memory is not accessible
> by the CPU (because PCIE restrictions), in your case it is. Thus the only
> difference is that in my case it can not be map inside the CPU page table
> while in yours it can.
>

I think thats a good idea to pool our efforts at the same time making progress

>>
>> (1) Traditional ioremap
>>
>> a. Memory is mapped into kernel (linear and virtual) and user space
>> b. These PFNs do not have struct pages associated with it
>> c. These special PFNs are marked with special flags inside the PTE
>> d. Cannot participate in core VM functions much because of this
>> e. Cannot do easy user space migrations
>>
>> (2) Zone ZONE_DEVICE
>>
>> a. Memory is mapped into kernel and user space
>> b. PFNs do have struct pages associated with it
>> c. These struct pages are allocated inside it's own memory range
>> d. Unfortunately the struct page's union containing LRU has been
>> used for struct dev_pagemap pointer
>> e. Hence it cannot be part of any LRU (like Page cache)
>> f. Hence file cached mapping cannot reside on these PFNs
>> g. Cannot do easy migrations
>>
>> I had also explored non LRU representation of this coherent device
>> memory where the integration with system RAM in the core VM is limited only
>> to the following functions. Not being inside LRU is definitely going to
>> reduce the scope of tight integration with system RAM.
>>
>> (1) Migration support between system RAM and coherent memory
>> (2) Migration support between various coherent memory nodes
>> (3) Isolation of the coherent memory
>> (4) Mapping the coherent memory into user space through driver's
>> struct vm_operations
>> (5) HW poisoning of the coherent memory
>>
>> Allocating the entire memory of the coherent device node right
>> after hot plug into ZONE_MOVABLE (where the memory is already inside the
>> buddy system) will still expose a time window where other user space
>> allocations can come into the coherent device memory node and prevent the
>> intended isolation. So traditional hot plug is not the solution. Hence
>> started looking into CMA based non LRU solution but then hit the following
>> roadblocks.
>>
>> (1) CMA does not support hot plugging of new memory node
>> a. CMA area needs to be marked during boot before buddy is
>> initialized
>> b. cma_alloc()/cma_release() can happen on the marked area
>> c. Should be able to mark the CMA areas just after memory hot plug
>> d. cma_alloc()/cma_release() can happen later after the hot plug
>> e. This is not currently supported right now
>>
>> (2) Mapped non LRU migration of pages
>> a. Recent work from Michan Kim makes non LRU page migratable
>> b. But it still does not support migration of mapped non LRU pages
>> c. With non LRU CMA reserved, again there are some additional
>> challenges
>>
>> With hot pluggable CMA and non LRU mapped migration support there
>> may be an alternate approach to represent coherent device memory. Please
>> do review this RFC proposal and let me know your comments or suggestions.
>> Thank you.
>
> You can take a look at hmm-v13 if you want to see how i do non LRU page
> migration. While i put most of the migration code inside hmm_migrate.c it
> could easily be move to migrate.c without hmm_ prefix.
>
> There is 2 missing piece with existing migrate code. First is to put memory
> allocation for destination under control of who call the migrate code. Second
> is to allow offloading the copy operation to device (ie not use the CPU to
> copy data).
>
> I believe same requirement also make sense for platform you are targeting.
> Thus same code can be use.
>
> hmm-v13 https://cgit.freedesktop.org/~glisse/linux/log/?h=hmm-v13
>

Thanks for the link

> I haven't posted this patchset yet because we are doing some modifications
> to the device driver API to accomodate some new features. But the ZONE_DEVICE
> changes and the overall migration code will stay the same more or less (i have
> patches that move it to migrate.c and share more code with existing migrate
> code).
>
> If you think i missed anything about lru and page cache please point it to
> me. Because when i audited code for that i didn't see any road block with
> the few fs i was looking at (ext4, xfs and core page cache code).
>
>> [...]
>
> Cheers,
> Jérôme
>

Cheers,
Balbir Singh.
Re: [RFC 0/8] Define coherent device memory node [ In reply to ]
On Tue, Oct 25, 2016 at 09:56:35AM +0530, Aneesh Kumar K.V wrote:
> Jerome Glisse <j.glisse@gmail.com> writes:
>
> > On Mon, Oct 24, 2016 at 10:01:49AM +0530, Anshuman Khandual wrote:
> >
> >> [...]
> >
> >> Core kernel memory features like reclamation, evictions etc. might
> >> need to be restricted or modified on the coherent device memory node as
> >> they can be performance limiting. The RFC does not propose anything on this
> >> yet but it can be looked into later on. For now it just disables Auto NUMA
> >> for any VMA which has coherent device memory.
> >>
> >> Seamless integration of coherent device memory with system memory
> >> will enable various other features, some of which can be listed as follows.
> >>
> >> a. Seamless migrations between system RAM and the coherent memory
> >> b. Will have asynchronous and high throughput migrations
> >> c. Be able to allocate huge order pages from these memory regions
> >> d. Restrict allocations to a large extent to the tasks using the
> >> device for workload acceleration
> >>
> >> Before concluding, will look into the reasons why the existing
> >> solutions don't work. There are two basic requirements which have to be
> >> satisfies before the coherent device memory can be integrated with core
> >> kernel seamlessly.
> >>
> >> a. PFN must have struct page
> >> b. Struct page must able to be inside standard LRU lists
> >>
> >> The above two basic requirements discard the existing method of
> >> device memory representation approaches like these which then requires the
> >> need of creating a new framework.
> >
> > I do not believe the LRU list is a hard requirement, yes when faulting in
> > a page inside the page cache it assumes it needs to be added to lru list.
> > But i think this can easily be work around.
> >
> > In HMM i am using ZONE_DEVICE and because memory is not accessible from CPU
> > (not everyone is bless with decent system bus like CAPI, CCIX, Gen-Z, ...)
> > so in my case a file back page must always be spawn first from a regular
> > page and once read from disk then i can migrate to GPU page.
> >
> > So if you accept this intermediary step you can easily use ZONE_DEVICE for
> > device memory. This way no lru, no complex dance to make the memory out of
> > reach from regular memory allocator.
>
> One of the reason to look at this as a NUMA node is to allow things like
> over-commit of coherent device memory. The pages backing CDM being part of
> lru and considering the coherent device as a numa node makes that really
> simpler (we can run kswapd for that node).

I am not convince that kswapd is what you want for overcommit, for HMM i leave
overcommit to device driver and they seem quite happy about handling that
themself. Only the device driver have enough information on what is worth
evicting or what need to be evicted.

> > I think we would have much to gain if we pool our effort on a single common
> > solution for device memory. In my case the device memory is not accessible
> > by the CPU (because PCIE restrictions), in your case it is. Thus the only
> > difference is that in my case it can not be map inside the CPU page table
> > while in yours it can.
>
> IMHO, we should be able to share the HMM migration approach. We
> definitely won't need the mirror page table part. That is one of the
> reson I requested HMM mirror page table to be a seperate patchset.

They will need to share one thing, that is hmm_pfn_t which is a special pfn
type in which i store HMM and migrate specific flag for migration. Because
i can not use the struct list_head lru of struct page i have to do migration
using array of pfn and i need to keep some flags per page during migration.

So i share the same type hmm_pfn_t btw mirror and migrate code. But that's
pretty small and it can be factor out of HMM, i can also just use pfn_t and
add flag i need their.

>
> >
> >>
> >> (1) Traditional ioremap
> >>
> >> a. Memory is mapped into kernel (linear and virtual) and user space
> >> b. These PFNs do not have struct pages associated with it
> >> c. These special PFNs are marked with special flags inside the PTE
> >> d. Cannot participate in core VM functions much because of this
> >> e. Cannot do easy user space migrations
> >>
> >> (2) Zone ZONE_DEVICE
> >>
> >> a. Memory is mapped into kernel and user space
> >> b. PFNs do have struct pages associated with it
> >> c. These struct pages are allocated inside it's own memory range
> >> d. Unfortunately the struct page's union containing LRU has been
> >> used for struct dev_pagemap pointer
> >> e. Hence it cannot be part of any LRU (like Page cache)
> >> f. Hence file cached mapping cannot reside on these PFNs
> >> g. Cannot do easy migrations
> >>
> >> I had also explored non LRU representation of this coherent device
> >> memory where the integration with system RAM in the core VM is limited only
> >> to the following functions. Not being inside LRU is definitely going to
> >> reduce the scope of tight integration with system RAM.
> >>
> >> (1) Migration support between system RAM and coherent memory
> >> (2) Migration support between various coherent memory nodes
> >> (3) Isolation of the coherent memory
> >> (4) Mapping the coherent memory into user space through driver's
> >> struct vm_operations
> >> (5) HW poisoning of the coherent memory
> >>
> >> Allocating the entire memory of the coherent device node right
> >> after hot plug into ZONE_MOVABLE (where the memory is already inside the
> >> buddy system) will still expose a time window where other user space
> >> allocations can come into the coherent device memory node and prevent the
> >> intended isolation. So traditional hot plug is not the solution. Hence
> >> started looking into CMA based non LRU solution but then hit the following
> >> roadblocks.
> >>
> >> (1) CMA does not support hot plugging of new memory node
> >> a. CMA area needs to be marked during boot before buddy is
> >> initialized
> >> b. cma_alloc()/cma_release() can happen on the marked area
> >> c. Should be able to mark the CMA areas just after memory hot plug
> >> d. cma_alloc()/cma_release() can happen later after the hot plug
> >> e. This is not currently supported right now
> >>
> >> (2) Mapped non LRU migration of pages
> >> a. Recent work from Michan Kim makes non LRU page migratable
> >> b. But it still does not support migration of mapped non LRU pages
> >> c. With non LRU CMA reserved, again there are some additional
> >> challenges
> >>
> >> With hot pluggable CMA and non LRU mapped migration support there
> >> may be an alternate approach to represent coherent device memory. Please
> >> do review this RFC proposal and let me know your comments or suggestions.
> >> Thank you.
> >
> > You can take a look at hmm-v13 if you want to see how i do non LRU page
> > migration. While i put most of the migration code inside hmm_migrate.c it
> > could easily be move to migrate.c without hmm_ prefix.
> >
> > There is 2 missing piece with existing migrate code. First is to put memory
> > allocation for destination under control of who call the migrate code. Second
> > is to allow offloading the copy operation to device (ie not use the CPU to
> > copy data).
> >
> > I believe same requirement also make sense for platform you are targeting.
> > Thus same code can be use.
> >
> > hmm-v13 https://cgit.freedesktop.org/~glisse/linux/log/?h=hmm-v13
> >
> > I haven't posted this patchset yet because we are doing some modifications
> > to the device driver API to accomodate some new features. But the ZONE_DEVICE
> > changes and the overall migration code will stay the same more or less (i have
> > patches that move it to migrate.c and share more code with existing migrate
> > code).
> >
> > If you think i missed anything about lru and page cache please point it to
> > me. Because when i audited code for that i didn't see any road block with
> > the few fs i was looking at (ext4, xfs and core page cache code).
>
> I looked at the hmm-v13 w.r.t migration and I guess some form of device
> callback/acceleration during migration is something we should definitely
> have. I still haven't figured out how non addressable and coherent device
> memory can fit together there. I was waiting for the page cache
> migration support to be pushed to the repository before I start looking
> at this closely.
>

The page cache migration does not touch the migrate code path. My issue with
page cache is writeback. The only difference with existing migrate code is
refcount check for ZONE_DEVICE page. Everything else is the same.

For writeback i need to use a bounce page so basicly i am trying to hook myself
along the ISA bounce infrastructure for bio and i think it is the easiest path
to solve this in my case.

In your case where block device can also access the device memory you don't
even need to use bounce page for writeback.

Cheers,
Jérôme
Re: [RFC 0/8] Define coherent device memory node [ In reply to ]
On Tue, Oct 25, 2016 at 11:07:39PM +1100, Balbir Singh wrote:
> On 25/10/16 04:09, Jerome Glisse wrote:
> > On Mon, Oct 24, 2016 at 10:01:49AM +0530, Anshuman Khandual wrote:
> >
> >> [...]
> >
> >> Core kernel memory features like reclamation, evictions etc. might
> >> need to be restricted or modified on the coherent device memory node as
> >> they can be performance limiting. The RFC does not propose anything on this
> >> yet but it can be looked into later on. For now it just disables Auto NUMA
> >> for any VMA which has coherent device memory.
> >>
> >> Seamless integration of coherent device memory with system memory
> >> will enable various other features, some of which can be listed as follows.
> >>
> >> a. Seamless migrations between system RAM and the coherent memory
> >> b. Will have asynchronous and high throughput migrations
> >> c. Be able to allocate huge order pages from these memory regions
> >> d. Restrict allocations to a large extent to the tasks using the
> >> device for workload acceleration
> >>
> >> Before concluding, will look into the reasons why the existing
> >> solutions don't work. There are two basic requirements which have to be
> >> satisfies before the coherent device memory can be integrated with core
> >> kernel seamlessly.
> >>
> >> a. PFN must have struct page
> >> b. Struct page must able to be inside standard LRU lists
> >>
> >> The above two basic requirements discard the existing method of
> >> device memory representation approaches like these which then requires the
> >> need of creating a new framework.
> >
> > I do not believe the LRU list is a hard requirement, yes when faulting in
> > a page inside the page cache it assumes it needs to be added to lru list.
> > But i think this can easily be work around.
> >
> > In HMM i am using ZONE_DEVICE and because memory is not accessible from CPU
> > (not everyone is bless with decent system bus like CAPI, CCIX, Gen-Z, ...)
> > so in my case a file back page must always be spawn first from a regular
> > page and once read from disk then i can migrate to GPU page.
> >
>
> I've not seen the HMM patchset, but read from disk will go to ZONE_DEVICE?
> Then get migrated?

Because in my case device memory is not accessible by anything except the device
(not entirely true but for sake of design it is) any page read from disk will be
first read into regular page (from regular system memory). It is only once it is
uptodate and in page cache that it can be migrated to a ZONE_DEVICE page.

So read from disk use an intermediary page. Write back is kind of the same i plan
on using a bounce page by leveraging existing bio bounce infrastructure.

Cheers,
Jérôme
Re: [RFC 0/8] Define coherent device memory node [ In reply to ]
On Tue, Oct 25, 2016 at 10:29:38AM +0530, Aneesh Kumar K.V wrote:
> Jerome Glisse <j.glisse@gmail.com> writes:
> > On Mon, Oct 24, 2016 at 10:01:49AM +0530, Anshuman Khandual wrote:

[...]

> > You can take a look at hmm-v13 if you want to see how i do non LRU page
> > migration. While i put most of the migration code inside hmm_migrate.c it
> > could easily be move to migrate.c without hmm_ prefix.
> >
> > There is 2 missing piece with existing migrate code. First is to put memory
> > allocation for destination under control of who call the migrate code. Second
> > is to allow offloading the copy operation to device (ie not use the CPU to
> > copy data).
> >
> > I believe same requirement also make sense for platform you are targeting.
> > Thus same code can be use.
> >
> > hmm-v13 https://cgit.freedesktop.org/~glisse/linux/log/?h=hmm-v13
> >
> > I haven't posted this patchset yet because we are doing some modifications
> > to the device driver API to accomodate some new features. But the ZONE_DEVICE
> > changes and the overall migration code will stay the same more or less (i have
> > patches that move it to migrate.c and share more code with existing migrate
> > code).
> >
> > If you think i missed anything about lru and page cache please point it to
> > me. Because when i audited code for that i didn't see any road block with
> > the few fs i was looking at (ext4, xfs and core page cache code).
> >
>
> The other restriction around ZONE_DEVICE is, it is not a managed zone.
> That prevents any direct allocation from coherent device by application.
> ie, we would like to force allocation from coherent device using
> interface like mbind(MPOL_BIND..) . Is that possible with ZONE_DEVICE ?

To achieve this we rely on device fault code path ie when device take a page fault
with help of HMM it will use existing memory if any for fault address but if CPU
page table is empty (and it is not file back vma because of readback) then device
can directly allocate device memory and HMM will update CPU page table to point to
newly allocated device memory.

So in fact i am not using existing kernel API to achieve this but the whole policy
of where to allocate and what to allocate is under device driver responsability and
device driver leverage its existing userspace API to get proper hint/direction from
the application.

Device memory is really a special case in my view, it only make sense to use it if
memory is actively access by device and only way device access memory is when it is
program to do so through the device driver API. There is nothing such as GPU threads
in the kernel and there is no way to spawn or move work thread to GPU. This are
specialize device and they require special per device API. So in my view using
existing kernel API such as mbind() is counter productive. You might have buggy
software that will mbind their memory to device and never use the device which
lead to device memory being wasted for a process that never use the device.

So my opinion is that you should not try to use existing kernel API to get policy
information from userspace but let the device driver gather such policy through its
own private API.

Cheers,
Jérôme
Re: [RFC 0/8] Define coherent device memory node [ In reply to ]
Jerome Glisse <j.glisse@gmail.com> writes:

> On Tue, Oct 25, 2016 at 10:29:38AM +0530, Aneesh Kumar K.V wrote:
>> Jerome Glisse <j.glisse@gmail.com> writes:
>> > On Mon, Oct 24, 2016 at 10:01:49AM +0530, Anshuman Khandual wrote:
>
> [...]
>
>> > You can take a look at hmm-v13 if you want to see how i do non LRU page
>> > migration. While i put most of the migration code inside hmm_migrate.c it
>> > could easily be move to migrate.c without hmm_ prefix.
>> >
>> > There is 2 missing piece with existing migrate code. First is to put memory
>> > allocation for destination under control of who call the migrate code. Second
>> > is to allow offloading the copy operation to device (ie not use the CPU to
>> > copy data).
>> >
>> > I believe same requirement also make sense for platform you are targeting.
>> > Thus same code can be use.
>> >
>> > hmm-v13 https://cgit.freedesktop.org/~glisse/linux/log/?h=hmm-v13
>> >
>> > I haven't posted this patchset yet because we are doing some modifications
>> > to the device driver API to accomodate some new features. But the ZONE_DEVICE
>> > changes and the overall migration code will stay the same more or less (i have
>> > patches that move it to migrate.c and share more code with existing migrate
>> > code).
>> >
>> > If you think i missed anything about lru and page cache please point it to
>> > me. Because when i audited code for that i didn't see any road block with
>> > the few fs i was looking at (ext4, xfs and core page cache code).
>> >
>>
>> The other restriction around ZONE_DEVICE is, it is not a managed zone.
>> That prevents any direct allocation from coherent device by application.
>> ie, we would like to force allocation from coherent device using
>> interface like mbind(MPOL_BIND..) . Is that possible with ZONE_DEVICE ?
>
> To achieve this we rely on device fault code path ie when device take a page fault
> with help of HMM it will use existing memory if any for fault address but if CPU
> page table is empty (and it is not file back vma because of readback) then device
> can directly allocate device memory and HMM will update CPU page table to point to
> newly allocated device memory.
>

That is ok if the device touch the page first. What if we want the
allocation touched first by cpu to come from GPU ?. Should we always
depend on GPU driver to migrate such pages later from system RAM to GPU
memory ?

-aneesh
Re: [RFC 0/8] Define coherent device memory node [ In reply to ]
On Tue, Oct 25, 2016 at 11:01:08PM +0530, Aneesh Kumar K.V wrote:
> Jerome Glisse <j.glisse@gmail.com> writes:
>
> > On Tue, Oct 25, 2016 at 10:29:38AM +0530, Aneesh Kumar K.V wrote:
> >> Jerome Glisse <j.glisse@gmail.com> writes:
> >> > On Mon, Oct 24, 2016 at 10:01:49AM +0530, Anshuman Khandual wrote:
> >
> > [...]
> >
> >> > You can take a look at hmm-v13 if you want to see how i do non LRU page
> >> > migration. While i put most of the migration code inside hmm_migrate.c it
> >> > could easily be move to migrate.c without hmm_ prefix.
> >> >
> >> > There is 2 missing piece with existing migrate code. First is to put memory
> >> > allocation for destination under control of who call the migrate code. Second
> >> > is to allow offloading the copy operation to device (ie not use the CPU to
> >> > copy data).
> >> >
> >> > I believe same requirement also make sense for platform you are targeting.
> >> > Thus same code can be use.
> >> >
> >> > hmm-v13 https://cgit.freedesktop.org/~glisse/linux/log/?h=hmm-v13
> >> >
> >> > I haven't posted this patchset yet because we are doing some modifications
> >> > to the device driver API to accomodate some new features. But the ZONE_DEVICE
> >> > changes and the overall migration code will stay the same more or less (i have
> >> > patches that move it to migrate.c and share more code with existing migrate
> >> > code).
> >> >
> >> > If you think i missed anything about lru and page cache please point it to
> >> > me. Because when i audited code for that i didn't see any road block with
> >> > the few fs i was looking at (ext4, xfs and core page cache code).
> >> >
> >>
> >> The other restriction around ZONE_DEVICE is, it is not a managed zone.
> >> That prevents any direct allocation from coherent device by application.
> >> ie, we would like to force allocation from coherent device using
> >> interface like mbind(MPOL_BIND..) . Is that possible with ZONE_DEVICE ?
> >
> > To achieve this we rely on device fault code path ie when device take a page fault
> > with help of HMM it will use existing memory if any for fault address but if CPU
> > page table is empty (and it is not file back vma because of readback) then device
> > can directly allocate device memory and HMM will update CPU page table to point to
> > newly allocated device memory.
> >
>
> That is ok if the device touch the page first. What if we want the
> allocation touched first by cpu to come from GPU ?. Should we always
> depend on GPU driver to migrate such pages later from system RAM to GPU
> memory ?
>

I am not sure what kind of workload would rather have every first CPU access for
a range to use device memory. So no my code does not handle that and it is pointless
for it as CPU can not access device memory for me.

That said nothing forbid to add support for ZONE_DEVICE with mbind() like syscall.
Thought my personnal preference would still be to avoid use of such generic syscall
but have device driver set allocation policy through its own userspace API (device
driver could reuse internal of mbind() to achieve the end result).

I am not saying that eveything you want to do is doable now with HMM but, nothing
preclude achieving what you want to achieve using ZONE_DEVICE. I really don't think
any of the existing mm mechanism (kswapd, lru, numa, ...) are nice fit and can be reuse
with device memory.

Each device is so different from the other that i don't believe in a one API fit all.
The drm GPU subsystem of the kernel is a testimony of how little can be share when it
comes to GPU. The only common code is modesetting. Everything that deals with how to
use GPU to compute stuff is per device and most of the logic is in userspace. So i do
not see any commonality that could be abstracted at syscall level. I would rather let
device driver stack (kernel and userspace) take such decision and have the higher level
API (OpenCL, Cuda, C++17, ...) expose something that make sense for each of them.
Programmer target those high level API and they intend to use the mechanism each offer
to manage memory and memory placement. I would say forcing them to use a second linux
specific API to achieve the latter is wrong, at lest for now.

So in the end if the mbind() syscall is done by the userspace side of the device driver
then why not just having the device driver communicate this through its own kernel
API (which can be much more expressive than what standardize syscall offers). I would
rather avoid making change to any syscall for now.

If latter, down the road, once the userspace ecosystem stabilize, we see that there
is a good level at which we can abstract memory policy for enough devices then and
only then it would make sense to either introduce new syscall or grow/modify existing
one. Right now i fear we could only make bad decision that we would regret down the
road.

I think we can achieve memory device support with the minimum amount of changes to mm
code and existing mm mechanism. Using ZONE_DEVICE already make sure that such memory
is kept out of most mm mechanism and hence avoid all the changes you had to make for
CDM node. It just looks a better fit from my point of view. I think it is worth
considering for your use case too. I am sure folks writting the device driver would
rather share more code between platform with grown up bus system (CAPI, CCIX, ...)
vs platform with kid bus system (PCIE let's forget about PCI and ISA :))

Cheers,
Jérôme
Re: [RFC 0/8] Define coherent device memory node [ In reply to ]
Jerome Glisse <j.glisse@gmail.com> writes:

> On Tue, Oct 25, 2016 at 09:56:35AM +0530, Aneesh Kumar K.V wrote:
>> Jerome Glisse <j.glisse@gmail.com> writes:
>>
>> > On Mon, Oct 24, 2016 at 10:01:49AM +0530, Anshuman Khandual wrote:
>> >
>> I looked at the hmm-v13 w.r.t migration and I guess some form of device
>> callback/acceleration during migration is something we should definitely
>> have. I still haven't figured out how non addressable and coherent device
>> memory can fit together there. I was waiting for the page cache
>> migration support to be pushed to the repository before I start looking
>> at this closely.
>>
>
> The page cache migration does not touch the migrate code path. My issue with
> page cache is writeback. The only difference with existing migrate code is
> refcount check for ZONE_DEVICE page. Everything else is the same.

What about the radix tree ? does file system migrate_page callback handle
replacing normal page with ZONE_DEVICE page/exceptional entries ?

>
> For writeback i need to use a bounce page so basicly i am trying to hook myself
> along the ISA bounce infrastructure for bio and i think it is the easiest path
> to solve this in my case.
>
> In your case where block device can also access the device memory you don't
> even need to use bounce page for writeback.
>

-aneesh
Re: [RFC 0/8] Define coherent device memory node [ In reply to ]
On 10/26/2016 12:22 AM, Jerome Glisse wrote:
> On Tue, Oct 25, 2016 at 11:01:08PM +0530, Aneesh Kumar K.V wrote:
>> Jerome Glisse <j.glisse@gmail.com> writes:
>>
>>> On Tue, Oct 25, 2016 at 10:29:38AM +0530, Aneesh Kumar K.V wrote:
>>>> Jerome Glisse <j.glisse@gmail.com> writes:
>>>>> On Mon, Oct 24, 2016 at 10:01:49AM +0530, Anshuman Khandual wrote:
>>>
>>> [...]
>>>
>>>>> You can take a look at hmm-v13 if you want to see how i do non LRU page
>>>>> migration. While i put most of the migration code inside hmm_migrate.c it
>>>>> could easily be move to migrate.c without hmm_ prefix.
>>>>>
>>>>> There is 2 missing piece with existing migrate code. First is to put memory
>>>>> allocation for destination under control of who call the migrate code. Second
>>>>> is to allow offloading the copy operation to device (ie not use the CPU to
>>>>> copy data).
>>>>>
>>>>> I believe same requirement also make sense for platform you are targeting.
>>>>> Thus same code can be use.
>>>>>
>>>>> hmm-v13 https://cgit.freedesktop.org/~glisse/linux/log/?h=hmm-v13
>>>>>
>>>>> I haven't posted this patchset yet because we are doing some modifications
>>>>> to the device driver API to accomodate some new features. But the ZONE_DEVICE
>>>>> changes and the overall migration code will stay the same more or less (i have
>>>>> patches that move it to migrate.c and share more code with existing migrate
>>>>> code).
>>>>>
>>>>> If you think i missed anything about lru and page cache please point it to
>>>>> me. Because when i audited code for that i didn't see any road block with
>>>>> the few fs i was looking at (ext4, xfs and core page cache code).
>>>>>
>>>>
>>>> The other restriction around ZONE_DEVICE is, it is not a managed zone.
>>>> That prevents any direct allocation from coherent device by application.
>>>> ie, we would like to force allocation from coherent device using
>>>> interface like mbind(MPOL_BIND..) . Is that possible with ZONE_DEVICE ?
>>>
>>> To achieve this we rely on device fault code path ie when device take a page fault
>>> with help of HMM it will use existing memory if any for fault address but if CPU
>>> page table is empty (and it is not file back vma because of readback) then device
>>> can directly allocate device memory and HMM will update CPU page table to point to
>>> newly allocated device memory.
>>>
>>
>> That is ok if the device touch the page first. What if we want the
>> allocation touched first by cpu to come from GPU ?. Should we always
>> depend on GPU driver to migrate such pages later from system RAM to GPU
>> memory ?
>>
>
> I am not sure what kind of workload would rather have every first CPU access for
> a range to use device memory. So no my code does not handle that and it is pointless
> for it as CPU can not access device memory for me.
>
> That said nothing forbid to add support for ZONE_DEVICE with mbind() like syscall.
> Thought my personnal preference would still be to avoid use of such generic syscall
> but have device driver set allocation policy through its own userspace API (device
> driver could reuse internal of mbind() to achieve the end result).
>
> I am not saying that eveything you want to do is doable now with HMM but, nothing
> preclude achieving what you want to achieve using ZONE_DEVICE. I really don't think
> any of the existing mm mechanism (kswapd, lru, numa, ...) are nice fit and can be reuse
> with device memory.
>
> Each device is so different from the other that i don't believe in a one API fit all.
> The drm GPU subsystem of the kernel is a testimony of how little can be share when it
> comes to GPU. The only common code is modesetting. Everything that deals with how to
> use GPU to compute stuff is per device and most of the logic is in userspace. So i do
> not see any commonality that could be abstracted at syscall level. I would rather let
> device driver stack (kernel and userspace) take such decision and have the higher level
> API (OpenCL, Cuda, C++17, ...) expose something that make sense for each of them.
> Programmer target those high level API and they intend to use the mechanism each offer
> to manage memory and memory placement. I would say forcing them to use a second linux
> specific API to achieve the latter is wrong, at lest for now.
>
> So in the end if the mbind() syscall is done by the userspace side of the device driver
> then why not just having the device driver communicate this through its own kernel
> API (which can be much more expressive than what standardize syscall offers). I would
> rather avoid making change to any syscall for now.
>
> If latter, down the road, once the userspace ecosystem stabilize, we see that there
> is a good level at which we can abstract memory policy for enough devices then and
> only then it would make sense to either introduce new syscall or grow/modify existing
> one. Right now i fear we could only make bad decision that we would regret down the
> road.
>
> I think we can achieve memory device support with the minimum amount of changes to mm
> code and existing mm mechanism. Using ZONE_DEVICE already make sure that such memory
> is kept out of most mm mechanism and hence avoid all the changes you had to make for
> CDM node. It just looks a better fit from my point of view. I think it is worth
> considering for your use case too. I am sure folks writting the device driver would
> rather share more code between platform with grown up bus system (CAPI, CCIX, ...)
> vs platform with kid bus system (PCIE let's forget about PCI and ISA :))

Because of coherent access between the CPU and the device, the intention is to use
the same buffer (VMA) accessed between CPU and device interchangeably through out
the run time of the application depending upon which side is accessing more and
how much of performance benefit it will provide after the migration. Now driver
managed memory is non LRU (whether we use ZONE_DEVICE or not) and we had issues
migrating non LRU pages mapped in user space. I am not sure whether Minchan had
changed the basic non LRU migration enablement code to support mapped non LRU
pages well. So in that case how we are going to migrate back and forth between
system RAM and device memory ?
Re: [RFC 0/8] Define coherent device memory node [ In reply to ]
On 10/26/2016 12:22 AM, Jerome Glisse wrote:
> On Tue, Oct 25, 2016 at 11:01:08PM +0530, Aneesh Kumar K.V wrote:
>> Jerome Glisse <j.glisse@gmail.com> writes:
>>
>>> On Tue, Oct 25, 2016 at 10:29:38AM +0530, Aneesh Kumar K.V wrote:
>>>> Jerome Glisse <j.glisse@gmail.com> writes:
>>>>> On Mon, Oct 24, 2016 at 10:01:49AM +0530, Anshuman Khandual wrote:
>>>
>>> [...]
>>>
>>>>> You can take a look at hmm-v13 if you want to see how i do non LRU page
>>>>> migration. While i put most of the migration code inside hmm_migrate.c it
>>>>> could easily be move to migrate.c without hmm_ prefix.
>>>>>
>>>>> There is 2 missing piece with existing migrate code. First is to put memory
>>>>> allocation for destination under control of who call the migrate code. Second
>>>>> is to allow offloading the copy operation to device (ie not use the CPU to
>>>>> copy data).
>>>>>
>>>>> I believe same requirement also make sense for platform you are targeting.
>>>>> Thus same code can be use.
>>>>>
>>>>> hmm-v13 https://cgit.freedesktop.org/~glisse/linux/log/?h=hmm-v13
>>>>>
>>>>> I haven't posted this patchset yet because we are doing some modifications
>>>>> to the device driver API to accomodate some new features. But the ZONE_DEVICE
>>>>> changes and the overall migration code will stay the same more or less (i have
>>>>> patches that move it to migrate.c and share more code with existing migrate
>>>>> code).
>>>>>
>>>>> If you think i missed anything about lru and page cache please point it to
>>>>> me. Because when i audited code for that i didn't see any road block with
>>>>> the few fs i was looking at (ext4, xfs and core page cache code).
>>>>>
>>>>
>>>> The other restriction around ZONE_DEVICE is, it is not a managed zone.
>>>> That prevents any direct allocation from coherent device by application.
>>>> ie, we would like to force allocation from coherent device using
>>>> interface like mbind(MPOL_BIND..) . Is that possible with ZONE_DEVICE ?
>>>
>>> To achieve this we rely on device fault code path ie when device take a page fault
>>> with help of HMM it will use existing memory if any for fault address but if CPU
>>> page table is empty (and it is not file back vma because of readback) then device
>>> can directly allocate device memory and HMM will update CPU page table to point to
>>> newly allocated device memory.
>>>
>>
>> That is ok if the device touch the page first. What if we want the
>> allocation touched first by cpu to come from GPU ?. Should we always
>> depend on GPU driver to migrate such pages later from system RAM to GPU
>> memory ?
>>
>
> I am not sure what kind of workload would rather have every first CPU access for
> a range to use device memory. So no my code does not handle that and it is pointless
> for it as CPU can not access device memory for me.

If the user space application can explicitly allocate device memory directly, we
can save one round of migration when the device start accessing it. But then one
can argue what problem statement the device would work on on a freshly allocated
memory which has not been accessed by CPU for loading the data yet. Will look into
this scenario in more detail.

>
> That said nothing forbid to add support for ZONE_DEVICE with mbind() like syscall.
> Thought my personnal preference would still be to avoid use of such generic syscall
> but have device driver set allocation policy through its own userspace API (device
> driver could reuse internal of mbind() to achieve the end result).

Okay, the basic premise of CDM node is to have a LRU based design where we can
avoid use of driver specific user space memory management code altogether.

>
> I am not saying that eveything you want to do is doable now with HMM but, nothing
> preclude achieving what you want to achieve using ZONE_DEVICE. I really don't think
> any of the existing mm mechanism (kswapd, lru, numa, ...) are nice fit and can be reuse
> with device memory.

With CDM node based design, the expectation is to get all/maximum core VM mechanism
working so that, driver has to do less device specific optimization.

>
> Each device is so different from the other that i don't believe in a one API fit all.

Right, so as I had mentioned in the cover letter, pglist_data->coherent_device actually
can become a bit mask indicating the type of coherent device the node is and that can
be used to implement multiple types of requirement in core mm for various kinds of
devices in the future.

> The drm GPU subsystem of the kernel is a testimony of how little can be share when it
> comes to GPU. The only common code is modesetting. Everything that deals with how to
> use GPU to compute stuff is per device and most of the logic is in userspace. So i do

Whats the basic reason which prevents such code/functionality sharing ?

> not see any commonality that could be abstracted at syscall level. I would rather let
> device driver stack (kernel and userspace) take such decision and have the higher level
> API (OpenCL, Cuda, C++17, ...) expose something that make sense for each of them.
> Programmer target those high level API and they intend to use the mechanism each offer
> to manage memory and memory placement. I would say forcing them to use a second linux
> specific API to achieve the latter is wrong, at lest for now.

But going forward dont we want a more closely integrated coherent device solution
which does not depend too much on a device driver stack ? and can be used from a
basic user space program ?
Re: [RFC 0/8] Define coherent device memory node [ In reply to ]
On Wed, Oct 26, 2016 at 04:43:10PM +0530, Anshuman Khandual wrote:
> On 10/26/2016 12:22 AM, Jerome Glisse wrote:
> > On Tue, Oct 25, 2016 at 11:01:08PM +0530, Aneesh Kumar K.V wrote:
> >> Jerome Glisse <j.glisse@gmail.com> writes:
> >>
> >>> On Tue, Oct 25, 2016 at 10:29:38AM +0530, Aneesh Kumar K.V wrote:
> >>>> Jerome Glisse <j.glisse@gmail.com> writes:
> >>>>> On Mon, Oct 24, 2016 at 10:01:49AM +0530, Anshuman Khandual wrote:
> >>>
> >>> [...]
> >>>
> >>>>> You can take a look at hmm-v13 if you want to see how i do non LRU page
> >>>>> migration. While i put most of the migration code inside hmm_migrate.c it
> >>>>> could easily be move to migrate.c without hmm_ prefix.
> >>>>>
> >>>>> There is 2 missing piece with existing migrate code. First is to put memory
> >>>>> allocation for destination under control of who call the migrate code. Second
> >>>>> is to allow offloading the copy operation to device (ie not use the CPU to
> >>>>> copy data).
> >>>>>
> >>>>> I believe same requirement also make sense for platform you are targeting.
> >>>>> Thus same code can be use.
> >>>>>
> >>>>> hmm-v13 https://cgit.freedesktop.org/~glisse/linux/log/?h=hmm-v13
> >>>>>
> >>>>> I haven't posted this patchset yet because we are doing some modifications
> >>>>> to the device driver API to accomodate some new features. But the ZONE_DEVICE
> >>>>> changes and the overall migration code will stay the same more or less (i have
> >>>>> patches that move it to migrate.c and share more code with existing migrate
> >>>>> code).
> >>>>>
> >>>>> If you think i missed anything about lru and page cache please point it to
> >>>>> me. Because when i audited code for that i didn't see any road block with
> >>>>> the few fs i was looking at (ext4, xfs and core page cache code).
> >>>>>
> >>>>
> >>>> The other restriction around ZONE_DEVICE is, it is not a managed zone.
> >>>> That prevents any direct allocation from coherent device by application.
> >>>> ie, we would like to force allocation from coherent device using
> >>>> interface like mbind(MPOL_BIND..) . Is that possible with ZONE_DEVICE ?
> >>>
> >>> To achieve this we rely on device fault code path ie when device take a page fault
> >>> with help of HMM it will use existing memory if any for fault address but if CPU
> >>> page table is empty (and it is not file back vma because of readback) then device
> >>> can directly allocate device memory and HMM will update CPU page table to point to
> >>> newly allocated device memory.
> >>>
> >>
> >> That is ok if the device touch the page first. What if we want the
> >> allocation touched first by cpu to come from GPU ?. Should we always
> >> depend on GPU driver to migrate such pages later from system RAM to GPU
> >> memory ?
> >>
> >
> > I am not sure what kind of workload would rather have every first CPU access for
> > a range to use device memory. So no my code does not handle that and it is pointless
> > for it as CPU can not access device memory for me.
> >
> > That said nothing forbid to add support for ZONE_DEVICE with mbind() like syscall.
> > Thought my personnal preference would still be to avoid use of such generic syscall
> > but have device driver set allocation policy through its own userspace API (device
> > driver could reuse internal of mbind() to achieve the end result).
> >
> > I am not saying that eveything you want to do is doable now with HMM but, nothing
> > preclude achieving what you want to achieve using ZONE_DEVICE. I really don't think
> > any of the existing mm mechanism (kswapd, lru, numa, ...) are nice fit and can be reuse
> > with device memory.
> >
> > Each device is so different from the other that i don't believe in a one API fit all.
> > The drm GPU subsystem of the kernel is a testimony of how little can be share when it
> > comes to GPU. The only common code is modesetting. Everything that deals with how to
> > use GPU to compute stuff is per device and most of the logic is in userspace. So i do
> > not see any commonality that could be abstracted at syscall level. I would rather let
> > device driver stack (kernel and userspace) take such decision and have the higher level
> > API (OpenCL, Cuda, C++17, ...) expose something that make sense for each of them.
> > Programmer target those high level API and they intend to use the mechanism each offer
> > to manage memory and memory placement. I would say forcing them to use a second linux
> > specific API to achieve the latter is wrong, at lest for now.
> >
> > So in the end if the mbind() syscall is done by the userspace side of the device driver
> > then why not just having the device driver communicate this through its own kernel
> > API (which can be much more expressive than what standardize syscall offers). I would
> > rather avoid making change to any syscall for now.
> >
> > If latter, down the road, once the userspace ecosystem stabilize, we see that there
> > is a good level at which we can abstract memory policy for enough devices then and
> > only then it would make sense to either introduce new syscall or grow/modify existing
> > one. Right now i fear we could only make bad decision that we would regret down the
> > road.
> >
> > I think we can achieve memory device support with the minimum amount of changes to mm
> > code and existing mm mechanism. Using ZONE_DEVICE already make sure that such memory
> > is kept out of most mm mechanism and hence avoid all the changes you had to make for
> > CDM node. It just looks a better fit from my point of view. I think it is worth
> > considering for your use case too. I am sure folks writting the device driver would
> > rather share more code between platform with grown up bus system (CAPI, CCIX, ...)
> > vs platform with kid bus system (PCIE let's forget about PCI and ISA :))
>
> Because of coherent access between the CPU and the device, the intention is to use
> the same buffer (VMA) accessed between CPU and device interchangeably through out
> the run time of the application depending upon which side is accessing more and
> how much of performance benefit it will provide after the migration. Now driver
> managed memory is non LRU (whether we use ZONE_DEVICE or not) and we had issues
> migrating non LRU pages mapped in user space. I am not sure whether Minchan had
> changed the basic non LRU migration enablement code to support mapped non LRU
> pages well. So in that case how we are going to migrate back and forth between
> system RAM and device memory ?

In my patchset there is no policy, it is all under device driver control which
decide what range of memory is migrated and when. I think only device driver as
proper knowledge to make such decision. By coalescing data from GPU counters and
request from application made through the uppler level programming API like
Cuda.

Note that even on PCIE the GPU can access the system memory coherently, it is the
reverse that is not doable (and there are limitation on the kind of atomic op the
device can do on system memory). So the hmm_mirror also allow that.

Cheers,
Jérôme
Re: [RFC 0/8] Define coherent device memory node [ In reply to ]
On Wed, Oct 26, 2016 at 04:39:19PM +0530, Aneesh Kumar K.V wrote:
> Jerome Glisse <j.glisse@gmail.com> writes:
>
> > On Tue, Oct 25, 2016 at 09:56:35AM +0530, Aneesh Kumar K.V wrote:
> >> Jerome Glisse <j.glisse@gmail.com> writes:
> >>
> >> > On Mon, Oct 24, 2016 at 10:01:49AM +0530, Anshuman Khandual wrote:
> >> >
> >> I looked at the hmm-v13 w.r.t migration and I guess some form of device
> >> callback/acceleration during migration is something we should definitely
> >> have. I still haven't figured out how non addressable and coherent device
> >> memory can fit together there. I was waiting for the page cache
> >> migration support to be pushed to the repository before I start looking
> >> at this closely.
> >>
> >
> > The page cache migration does not touch the migrate code path. My issue with
> > page cache is writeback. The only difference with existing migrate code is
> > refcount check for ZONE_DEVICE page. Everything else is the same.
>
> What about the radix tree ? does file system migrate_page callback handle
> replacing normal page with ZONE_DEVICE page/exceptional entries ?
>

It use the exact same existing code (from mm/migrate.c) so yes the radix tree
is updated and buffer_head are migrated.

Jérôme
Re: [RFC 0/8] Define coherent device memory node [ In reply to ]
On Wed, Oct 26, 2016 at 06:26:02PM +0530, Anshuman Khandual wrote:
> On 10/26/2016 12:22 AM, Jerome Glisse wrote:
> > On Tue, Oct 25, 2016 at 11:01:08PM +0530, Aneesh Kumar K.V wrote:
> >> Jerome Glisse <j.glisse@gmail.com> writes:
> >>
> >>> On Tue, Oct 25, 2016 at 10:29:38AM +0530, Aneesh Kumar K.V wrote:
> >>>> Jerome Glisse <j.glisse@gmail.com> writes:
> >>>>> On Mon, Oct 24, 2016 at 10:01:49AM +0530, Anshuman Khandual wrote:
> >>>
> >>> [...]
> >>>
> >>>>> You can take a look at hmm-v13 if you want to see how i do non LRU page
> >>>>> migration. While i put most of the migration code inside hmm_migrate.c it
> >>>>> could easily be move to migrate.c without hmm_ prefix.
> >>>>>
> >>>>> There is 2 missing piece with existing migrate code. First is to put memory
> >>>>> allocation for destination under control of who call the migrate code. Second
> >>>>> is to allow offloading the copy operation to device (ie not use the CPU to
> >>>>> copy data).
> >>>>>
> >>>>> I believe same requirement also make sense for platform you are targeting.
> >>>>> Thus same code can be use.
> >>>>>
> >>>>> hmm-v13 https://cgit.freedesktop.org/~glisse/linux/log/?h=hmm-v13
> >>>>>
> >>>>> I haven't posted this patchset yet because we are doing some modifications
> >>>>> to the device driver API to accomodate some new features. But the ZONE_DEVICE
> >>>>> changes and the overall migration code will stay the same more or less (i have
> >>>>> patches that move it to migrate.c and share more code with existing migrate
> >>>>> code).
> >>>>>
> >>>>> If you think i missed anything about lru and page cache please point it to
> >>>>> me. Because when i audited code for that i didn't see any road block with
> >>>>> the few fs i was looking at (ext4, xfs and core page cache code).
> >>>>>
> >>>>
> >>>> The other restriction around ZONE_DEVICE is, it is not a managed zone.
> >>>> That prevents any direct allocation from coherent device by application.
> >>>> ie, we would like to force allocation from coherent device using
> >>>> interface like mbind(MPOL_BIND..) . Is that possible with ZONE_DEVICE ?
> >>>
> >>> To achieve this we rely on device fault code path ie when device take a page fault
> >>> with help of HMM it will use existing memory if any for fault address but if CPU
> >>> page table is empty (and it is not file back vma because of readback) then device
> >>> can directly allocate device memory and HMM will update CPU page table to point to
> >>> newly allocated device memory.
> >>>
> >>
> >> That is ok if the device touch the page first. What if we want the
> >> allocation touched first by cpu to come from GPU ?. Should we always
> >> depend on GPU driver to migrate such pages later from system RAM to GPU
> >> memory ?
> >>
> >
> > I am not sure what kind of workload would rather have every first CPU access for
> > a range to use device memory. So no my code does not handle that and it is pointless
> > for it as CPU can not access device memory for me.
>
> If the user space application can explicitly allocate device memory directly, we
> can save one round of migration when the device start accessing it. But then one
> can argue what problem statement the device would work on on a freshly allocated
> memory which has not been accessed by CPU for loading the data yet. Will look into
> this scenario in more detail.
>
> >
> > That said nothing forbid to add support for ZONE_DEVICE with mbind() like syscall.
> > Thought my personnal preference would still be to avoid use of such generic syscall
> > but have device driver set allocation policy through its own userspace API (device
> > driver could reuse internal of mbind() to achieve the end result).
>
> Okay, the basic premise of CDM node is to have a LRU based design where we can
> avoid use of driver specific user space memory management code altogether.

And i think it is not a good fit, at least not for GPU. GPU device driver have a
big chunk of code dedicated to memory management. You can look at drm/ttm and at
userspace (most is in userspace). It is not because we want to reinvent the wheel
it is because they are some unique constraint.


> >
> > I am not saying that eveything you want to do is doable now with HMM but, nothing
> > preclude achieving what you want to achieve using ZONE_DEVICE. I really don't think
> > any of the existing mm mechanism (kswapd, lru, numa, ...) are nice fit and can be reuse
> > with device memory.
>
> With CDM node based design, the expectation is to get all/maximum core VM mechanism
> working so that, driver has to do less device specific optimization.

I think this is a bad idea, today, for GPU but i might be wrong.

> >
> > Each device is so different from the other that i don't believe in a one API fit all.
>
> Right, so as I had mentioned in the cover letter, pglist_data->coherent_device actually
> can become a bit mask indicating the type of coherent device the node is and that can
> be used to implement multiple types of requirement in core mm for various kinds of
> devices in the future.

I really don't want to move GPU memory management into core mm, if you only concider GPGPU
then it _might_ make sense but for graphic side i definitly don't think so. There are way
to much device specific consideration to have in respect of memory management for GPU
(not only in between different vendor but difference between different generation).


> > The drm GPU subsystem of the kernel is a testimony of how little can be share when it
> > comes to GPU. The only common code is modesetting. Everything that deals with how to
> > use GPU to compute stuff is per device and most of the logic is in userspace. So i do
>
> Whats the basic reason which prevents such code/functionality sharing ?

While the higher level API (OpenGL, OpenCL, Vulkan, Cuda, ...) offer an abstraction model,
they are all different abstractions. They are just no way to have kernel expose a common
API that would allow all of the above to be implemented.

Each GPU have complex memory management and requirement (not only differ between vendor
but also between generation of same vendor). They have different isa for each generation.
They have different way to schedule job for each generation. They offer different sync
mechanism. They have different page table format, mmu, ...

Basicly each GPU generation is a platform on it is own, like arm, ppc, x86, ... so i do
not see a way to expose a common API and i don't think anyone who as work on any number
of GPU see one either. I wish but it is just not the case.


> > not see any commonality that could be abstracted at syscall level. I would rather let
> > device driver stack (kernel and userspace) take such decision and have the higher level
> > API (OpenCL, Cuda, C++17, ...) expose something that make sense for each of them.
> > Programmer target those high level API and they intend to use the mechanism each offer
> > to manage memory and memory placement. I would say forcing them to use a second linux
> > specific API to achieve the latter is wrong, at lest for now.
>
> But going forward dont we want a more closely integrated coherent device solution
> which does not depend too much on a device driver stack ? and can be used from a
> basic user space program ?

That is something i want, but i strongly believe we are not there yet, we have no real
world experience. All we have in the open source community is the graphic stack (drm)
and the graphic stack clearly shows that today there is no common denominator between
GPU outside of modesetting.

So while i share the same aim, i think for now we need to have real experience. Once we
have something like OpenCL >= 2.0, C++17 and couple other userspace API being actively
use on linux with different coherent devices then we can start looking at finding a
common denominator that make sense for enough devices.

I am sure device driver would like to get rid of their custom memory management but i
don't think this is applicable now. I fear existing mm code would always make the worst
decision when it comes to memory placement, migration and reclaim.

Cheers,
Jérôme
Re: [RFC 0/8] Define coherent device memory node [ In reply to ]
On 10/26/2016 09:32 PM, Jerome Glisse wrote:
> On Wed, Oct 26, 2016 at 04:43:10PM +0530, Anshuman Khandual wrote:
>> On 10/26/2016 12:22 AM, Jerome Glisse wrote:
>>> On Tue, Oct 25, 2016 at 11:01:08PM +0530, Aneesh Kumar K.V wrote:
>>>> Jerome Glisse <j.glisse@gmail.com> writes:
>>>>
>>>>> On Tue, Oct 25, 2016 at 10:29:38AM +0530, Aneesh Kumar K.V wrote:
>>>>>> Jerome Glisse <j.glisse@gmail.com> writes:
>>>>>>> On Mon, Oct 24, 2016 at 10:01:49AM +0530, Anshuman Khandual wrote:
>>>>>
>>>>> [...]
>>>>>
>>>>>>> You can take a look at hmm-v13 if you want to see how i do non LRU page
>>>>>>> migration. While i put most of the migration code inside hmm_migrate.c it
>>>>>>> could easily be move to migrate.c without hmm_ prefix.
>>>>>>>
>>>>>>> There is 2 missing piece with existing migrate code. First is to put memory
>>>>>>> allocation for destination under control of who call the migrate code. Second
>>>>>>> is to allow offloading the copy operation to device (ie not use the CPU to
>>>>>>> copy data).
>>>>>>>
>>>>>>> I believe same requirement also make sense for platform you are targeting.
>>>>>>> Thus same code can be use.
>>>>>>>
>>>>>>> hmm-v13 https://cgit.freedesktop.org/~glisse/linux/log/?h=hmm-v13
>>>>>>>
>>>>>>> I haven't posted this patchset yet because we are doing some modifications
>>>>>>> to the device driver API to accomodate some new features. But the ZONE_DEVICE
>>>>>>> changes and the overall migration code will stay the same more or less (i have
>>>>>>> patches that move it to migrate.c and share more code with existing migrate
>>>>>>> code).
>>>>>>>
>>>>>>> If you think i missed anything about lru and page cache please point it to
>>>>>>> me. Because when i audited code for that i didn't see any road block with
>>>>>>> the few fs i was looking at (ext4, xfs and core page cache code).
>>>>>>>
>>>>>>
>>>>>> The other restriction around ZONE_DEVICE is, it is not a managed zone.
>>>>>> That prevents any direct allocation from coherent device by application.
>>>>>> ie, we would like to force allocation from coherent device using
>>>>>> interface like mbind(MPOL_BIND..) . Is that possible with ZONE_DEVICE ?
>>>>>
>>>>> To achieve this we rely on device fault code path ie when device take a page fault
>>>>> with help of HMM it will use existing memory if any for fault address but if CPU
>>>>> page table is empty (and it is not file back vma because of readback) then device
>>>>> can directly allocate device memory and HMM will update CPU page table to point to
>>>>> newly allocated device memory.
>>>>>
>>>>
>>>> That is ok if the device touch the page first. What if we want the
>>>> allocation touched first by cpu to come from GPU ?. Should we always
>>>> depend on GPU driver to migrate such pages later from system RAM to GPU
>>>> memory ?
>>>>
>>>
>>> I am not sure what kind of workload would rather have every first CPU access for
>>> a range to use device memory. So no my code does not handle that and it is pointless
>>> for it as CPU can not access device memory for me.
>>>
>>> That said nothing forbid to add support for ZONE_DEVICE with mbind() like syscall.
>>> Thought my personnal preference would still be to avoid use of such generic syscall
>>> but have device driver set allocation policy through its own userspace API (device
>>> driver could reuse internal of mbind() to achieve the end result).
>>>
>>> I am not saying that eveything you want to do is doable now with HMM but, nothing
>>> preclude achieving what you want to achieve using ZONE_DEVICE. I really don't think
>>> any of the existing mm mechanism (kswapd, lru, numa, ...) are nice fit and can be reuse
>>> with device memory.
>>>
>>> Each device is so different from the other that i don't believe in a one API fit all.
>>> The drm GPU subsystem of the kernel is a testimony of how little can be share when it
>>> comes to GPU. The only common code is modesetting. Everything that deals with how to
>>> use GPU to compute stuff is per device and most of the logic is in userspace. So i do
>>> not see any commonality that could be abstracted at syscall level. I would rather let
>>> device driver stack (kernel and userspace) take such decision and have the higher level
>>> API (OpenCL, Cuda, C++17, ...) expose something that make sense for each of them.
>>> Programmer target those high level API and they intend to use the mechanism each offer
>>> to manage memory and memory placement. I would say forcing them to use a second linux
>>> specific API to achieve the latter is wrong, at lest for now.
>>>
>>> So in the end if the mbind() syscall is done by the userspace side of the device driver
>>> then why not just having the device driver communicate this through its own kernel
>>> API (which can be much more expressive than what standardize syscall offers). I would
>>> rather avoid making change to any syscall for now.
>>>
>>> If latter, down the road, once the userspace ecosystem stabilize, we see that there
>>> is a good level at which we can abstract memory policy for enough devices then and
>>> only then it would make sense to either introduce new syscall or grow/modify existing
>>> one. Right now i fear we could only make bad decision that we would regret down the
>>> road.
>>>
>>> I think we can achieve memory device support with the minimum amount of changes to mm
>>> code and existing mm mechanism. Using ZONE_DEVICE already make sure that such memory
>>> is kept out of most mm mechanism and hence avoid all the changes you had to make for
>>> CDM node. It just looks a better fit from my point of view. I think it is worth
>>> considering for your use case too. I am sure folks writting the device driver would
>>> rather share more code between platform with grown up bus system (CAPI, CCIX, ...)
>>> vs platform with kid bus system (PCIE let's forget about PCI and ISA :))
>>
>> Because of coherent access between the CPU and the device, the intention is to use
>> the same buffer (VMA) accessed between CPU and device interchangeably through out
>> the run time of the application depending upon which side is accessing more and
>> how much of performance benefit it will provide after the migration. Now driver
>> managed memory is non LRU (whether we use ZONE_DEVICE or not) and we had issues
>> migrating non LRU pages mapped in user space. I am not sure whether Minchan had
>> changed the basic non LRU migration enablement code to support mapped non LRU
>> pages well. So in that case how we are going to migrate back and forth between
>> system RAM and device memory ?
>
> In my patchset there is no policy, it is all under device driver control which
> decide what range of memory is migrated and when. I think only device driver as
> proper knowledge to make such decision. By coalescing data from GPU counters and
> request from application made through the uppler level programming API like
> Cuda.
>

Right, I understand that. But what I pointed out here is that there are problems
now migrating user mapped pages back and forth between LRU system RAM memory and
non LRU device memory which is yet to be solved. Because you are proposing a non
LRU based design with ZONE_DEVICE, how we are solving/working around these
problems for bi-directional migration ?

> Note that even on PCIE the GPU can access the system memory coherently, it is the
> reverse that is not doable (and there are limitation on the kind of atomic op the
> device can do on system memory). So the hmm_mirror also allow that.

Okay.
Re: [RFC 0/8] Define coherent device memory node [ In reply to ]
On 10/27/2016 10:08 AM, Anshuman Khandual wrote:
> On 10/26/2016 09:32 PM, Jerome Glisse wrote:
>> On Wed, Oct 26, 2016 at 04:43:10PM +0530, Anshuman Khandual wrote:
>>> On 10/26/2016 12:22 AM, Jerome Glisse wrote:
>>>> On Tue, Oct 25, 2016 at 11:01:08PM +0530, Aneesh Kumar K.V wrote:
>>>>> Jerome Glisse <j.glisse@gmail.com> writes:
>>>>>
>>>>>> On Tue, Oct 25, 2016 at 10:29:38AM +0530, Aneesh Kumar K.V wrote:
>>>>>>> Jerome Glisse <j.glisse@gmail.com> writes:
>>>>>>>> On Mon, Oct 24, 2016 at 10:01:49AM +0530, Anshuman Khandual wrote:
>>>>>>
>>>>>> [...]
>>>>>>
>>>>>>>> You can take a look at hmm-v13 if you want to see how i do non LRU page
>>>>>>>> migration. While i put most of the migration code inside hmm_migrate.c it
>>>>>>>> could easily be move to migrate.c without hmm_ prefix.
>>>>>>>>
>>>>>>>> There is 2 missing piece with existing migrate code. First is to put memory
>>>>>>>> allocation for destination under control of who call the migrate code. Second
>>>>>>>> is to allow offloading the copy operation to device (ie not use the CPU to
>>>>>>>> copy data).
>>>>>>>>
>>>>>>>> I believe same requirement also make sense for platform you are targeting.
>>>>>>>> Thus same code can be use.
>>>>>>>>
>>>>>>>> hmm-v13 https://cgit.freedesktop.org/~glisse/linux/log/?h=hmm-v13
>>>>>>>>
>>>>>>>> I haven't posted this patchset yet because we are doing some modifications
>>>>>>>> to the device driver API to accomodate some new features. But the ZONE_DEVICE
>>>>>>>> changes and the overall migration code will stay the same more or less (i have
>>>>>>>> patches that move it to migrate.c and share more code with existing migrate
>>>>>>>> code).
>>>>>>>>
>>>>>>>> If you think i missed anything about lru and page cache please point it to
>>>>>>>> me. Because when i audited code for that i didn't see any road block with
>>>>>>>> the few fs i was looking at (ext4, xfs and core page cache code).
>>>>>>>>
>>>>>>>
>>>>>>> The other restriction around ZONE_DEVICE is, it is not a managed zone.
>>>>>>> That prevents any direct allocation from coherent device by application.
>>>>>>> ie, we would like to force allocation from coherent device using
>>>>>>> interface like mbind(MPOL_BIND..) . Is that possible with ZONE_DEVICE ?
>>>>>>
>>>>>> To achieve this we rely on device fault code path ie when device take a page fault
>>>>>> with help of HMM it will use existing memory if any for fault address but if CPU
>>>>>> page table is empty (and it is not file back vma because of readback) then device
>>>>>> can directly allocate device memory and HMM will update CPU page table to point to
>>>>>> newly allocated device memory.
>>>>>>
>>>>>
>>>>> That is ok if the device touch the page first. What if we want the
>>>>> allocation touched first by cpu to come from GPU ?. Should we always
>>>>> depend on GPU driver to migrate such pages later from system RAM to GPU
>>>>> memory ?
>>>>>
>>>>
>>>> I am not sure what kind of workload would rather have every first CPU access for
>>>> a range to use device memory. So no my code does not handle that and it is pointless
>>>> for it as CPU can not access device memory for me.
>>>>
>>>> That said nothing forbid to add support for ZONE_DEVICE with mbind() like syscall.
>>>> Thought my personnal preference would still be to avoid use of such generic syscall
>>>> but have device driver set allocation policy through its own userspace API (device
>>>> driver could reuse internal of mbind() to achieve the end result).
>>>>
>>>> I am not saying that eveything you want to do is doable now with HMM but, nothing
>>>> preclude achieving what you want to achieve using ZONE_DEVICE. I really don't think
>>>> any of the existing mm mechanism (kswapd, lru, numa, ...) are nice fit and can be reuse
>>>> with device memory.
>>>>
>>>> Each device is so different from the other that i don't believe in a one API fit all.
>>>> The drm GPU subsystem of the kernel is a testimony of how little can be share when it
>>>> comes to GPU. The only common code is modesetting. Everything that deals with how to
>>>> use GPU to compute stuff is per device and most of the logic is in userspace. So i do
>>>> not see any commonality that could be abstracted at syscall level. I would rather let
>>>> device driver stack (kernel and userspace) take such decision and have the higher level
>>>> API (OpenCL, Cuda, C++17, ...) expose something that make sense for each of them.
>>>> Programmer target those high level API and they intend to use the mechanism each offer
>>>> to manage memory and memory placement. I would say forcing them to use a second linux
>>>> specific API to achieve the latter is wrong, at lest for now.
>>>>
>>>> So in the end if the mbind() syscall is done by the userspace side of the device driver
>>>> then why not just having the device driver communicate this through its own kernel
>>>> API (which can be much more expressive than what standardize syscall offers). I would
>>>> rather avoid making change to any syscall for now.
>>>>
>>>> If latter, down the road, once the userspace ecosystem stabilize, we see that there
>>>> is a good level at which we can abstract memory policy for enough devices then and
>>>> only then it would make sense to either introduce new syscall or grow/modify existing
>>>> one. Right now i fear we could only make bad decision that we would regret down the
>>>> road.
>>>>
>>>> I think we can achieve memory device support with the minimum amount of changes to mm
>>>> code and existing mm mechanism. Using ZONE_DEVICE already make sure that such memory
>>>> is kept out of most mm mechanism and hence avoid all the changes you had to make for
>>>> CDM node. It just looks a better fit from my point of view. I think it is worth
>>>> considering for your use case too. I am sure folks writting the device driver would
>>>> rather share more code between platform with grown up bus system (CAPI, CCIX, ...)
>>>> vs platform with kid bus system (PCIE let's forget about PCI and ISA :))
>>>
>>> Because of coherent access between the CPU and the device, the intention is to use
>>> the same buffer (VMA) accessed between CPU and device interchangeably through out
>>> the run time of the application depending upon which side is accessing more and
>>> how much of performance benefit it will provide after the migration. Now driver
>>> managed memory is non LRU (whether we use ZONE_DEVICE or not) and we had issues
>>> migrating non LRU pages mapped in user space. I am not sure whether Minchan had
>>> changed the basic non LRU migration enablement code to support mapped non LRU
>>> pages well. So in that case how we are going to migrate back and forth between
>>> system RAM and device memory ?
>>
>> In my patchset there is no policy, it is all under device driver control which
>> decide what range of memory is migrated and when. I think only device driver as
>> proper knowledge to make such decision. By coalescing data from GPU counters and
>> request from application made through the uppler level programming API like
>> Cuda.
>>
>
> Right, I understand that. But what I pointed out here is that there are problems
> now migrating user mapped pages back and forth between LRU system RAM memory and
> non LRU device memory which is yet to be solved. Because you are proposing a non
> LRU based design with ZONE_DEVICE, how we are solving/working around these
> problems for bi-directional migration ?

Let me elaborate on this bit more. Before non LRU migration support patch series
from Minchan, it was not possible to migrate non LRU pages which are generally
driver managed through migrate_pages interface. This was affecting the ability
to do compaction on platforms which has a large share of non LRU pages. That series
actually solved the migration problem and allowed compaction. But it still did not
solve the migration problem for non LRU *user mapped* pages. So if the non LRU pages
are mapped into a process's page table and being accessed from user space, it can
not be moved using migrate_pages interface.

Minchan had a draft solution for that problem which is still hosted here. On his
suggestion I had tried this solution but still faced some other problems during
mapped pages migration. (NOTE: IIRC this was not posted in the community)

git://git.kernel.org/pub/scm/linux/kernel/git/minchan/linux.git with the following
branch (non-lru-mapped-v1r2-v4.7-rc4-mmotm-2016-06-24-15-53)

As I had mentioned earlier, we intend to support all possible migrations between
system RAM (LRU) and device memory (Non LRU) for user space mapped pages.

(1) System RAM (Anon mapping) --> Device memory, back and forth many times
(2) System RAM (File mapping) --> Device memory, back and forth many times

This is not happening now with non LRU pages. Here are some of reasons but before
that some notes.

* Driver initiates all the migrations
* Driver does the isolation of pages
* Driver puts the isolated pages in a linked list
* Driver passes the linked list to migrate_pages interface for migration
* IIRC isolation of non LRU pages happens through page->as->aops->isolate_page call
* If migration fails, call page->as->aops->putback_page to give the page back to the
device driver

1. queue_pages_range() currently does not work with non LRU pages, needs to be fixed

2. After a successful migration from non LRU device memory to LRU system RAM, the non
LRU will be freed back. Right now migrate_pages releases these pages to buddy, but
in this situation we need the pages to be given back to the driver instead. Hence
migrate_pages needs to be changed to accommodate this.

3. After LRU system RAM to non LRU device migration for a mapped page, does the new
page (which came from device memory) will be part of core MM LRU either for Anon
or File mapping ?

4. After LRU (Anon mapped) system RAM to non LRU device migration for a mapped page,
how we are going to store "address_space->address_space_operations" and "Anon VMA
Chain" reverse mapping information both on the page->mapping element ?

5. After LRU (File mapped) system RAM to non LRU device migration for a mapped page,
how we are going to store "address_space->address_space_operations" of the device
driver and radix tree based reverse mapping information for the existing file
mapping both on the same page->mapping element ?

6. IIRC, it was not possible to retain the non LRU identify (page->as->aops which will
defined inside the device driver) and the reverse mapping information (either anon
or file mapping) together after first round of migration. This non LRU identity needs
to be retained continuously if we ever need to return this page to device driver after
successful migration to system RAM or for isolation/putback purpose or something else.

All the reasons explained above was preventing a continuous ping-pong scheme of migration
between system RAM LRU buddy pages and device memory non LRU pages which is one of the
primary requirements for exploiting coherent device memory. Do you think we can solve these
problems with ZONE_DEVICE and HMM framework ?
Re: [RFC 0/8] Define coherent device memory node [ In reply to ]
On 27/10/16 03:28, Jerome Glisse wrote:
> On Wed, Oct 26, 2016 at 06:26:02PM +0530, Anshuman Khandual wrote:
>> On 10/26/2016 12:22 AM, Jerome Glisse wrote:
>>> On Tue, Oct 25, 2016 at 11:01:08PM +0530, Aneesh Kumar K.V wrote:
>>>> Jerome Glisse <j.glisse@gmail.com> writes:
>>>>
>>>>> On Tue, Oct 25, 2016 at 10:29:38AM +0530, Aneesh Kumar K.V wrote:
>>>>>> Jerome Glisse <j.glisse@gmail.com> writes:
>>>>>>> On Mon, Oct 24, 2016 at 10:01:49AM +0530, Anshuman Khandual wrote:
>>>>>
>>>>> [...]
>>>>>
>>>>>>> You can take a look at hmm-v13 if you want to see how i do non LRU page
>>>>>>> migration. While i put most of the migration code inside hmm_migrate.c it
>>>>>>> could easily be move to migrate.c without hmm_ prefix.
>>>>>>>
>>>>>>> There is 2 missing piece with existing migrate code. First is to put memory
>>>>>>> allocation for destination under control of who call the migrate code. Second
>>>>>>> is to allow offloading the copy operation to device (ie not use the CPU to
>>>>>>> copy data).
>>>>>>>
>>>>>>> I believe same requirement also make sense for platform you are targeting.
>>>>>>> Thus same code can be use.
>>>>>>>
>>>>>>> hmm-v13 https://cgit.freedesktop.org/~glisse/linux/log/?h=hmm-v13
>>>>>>>
>>>>>>> I haven't posted this patchset yet because we are doing some modifications
>>>>>>> to the device driver API to accomodate some new features. But the ZONE_DEVICE
>>>>>>> changes and the overall migration code will stay the same more or less (i have
>>>>>>> patches that move it to migrate.c and share more code with existing migrate
>>>>>>> code).
>>>>>>>
>>>>>>> If you think i missed anything about lru and page cache please point it to
>>>>>>> me. Because when i audited code for that i didn't see any road block with
>>>>>>> the few fs i was looking at (ext4, xfs and core page cache code).
>>>>>>>
>>>>>>
>>>>>> The other restriction around ZONE_DEVICE is, it is not a managed zone.
>>>>>> That prevents any direct allocation from coherent device by application.
>>>>>> ie, we would like to force allocation from coherent device using
>>>>>> interface like mbind(MPOL_BIND..) . Is that possible with ZONE_DEVICE ?
>>>>>
>>>>> To achieve this we rely on device fault code path ie when device take a page fault
>>>>> with help of HMM it will use existing memory if any for fault address but if CPU
>>>>> page table is empty (and it is not file back vma because of readback) then device
>>>>> can directly allocate device memory and HMM will update CPU page table to point to
>>>>> newly allocated device memory.
>>>>>
>>>>
>>>> That is ok if the device touch the page first. What if we want the
>>>> allocation touched first by cpu to come from GPU ?. Should we always
>>>> depend on GPU driver to migrate such pages later from system RAM to GPU
>>>> memory ?
>>>>
>>>
>>> I am not sure what kind of workload would rather have every first CPU access for
>>> a range to use device memory. So no my code does not handle that and it is pointless
>>> for it as CPU can not access device memory for me.
>>
>> If the user space application can explicitly allocate device memory directly, we
>> can save one round of migration when the device start accessing it. But then one
>> can argue what problem statement the device would work on on a freshly allocated
>> memory which has not been accessed by CPU for loading the data yet. Will look into
>> this scenario in more detail.
>>
>>>
>>> That said nothing forbid to add support for ZONE_DEVICE with mbind() like syscall.
>>> Thought my personnal preference would still be to avoid use of such generic syscall
>>> but have device driver set allocation policy through its own userspace API (device
>>> driver could reuse internal of mbind() to achieve the end result).
>>
>> Okay, the basic premise of CDM node is to have a LRU based design where we can
>> avoid use of driver specific user space memory management code altogether.
>
> And i think it is not a good fit, at least not for GPU. GPU device driver have a
> big chunk of code dedicated to memory management. You can look at drm/ttm and at
> userspace (most is in userspace). It is not because we want to reinvent the wheel
> it is because they are some unique constraint.
>

Could you elaborate on the unique constraints a bit more? I looked at ttm briefly
(specifically ttm_memory.c), I can see zones being replicated, it feels like a mini-mm
is embedded in there.

>
>>>
>>> I am not saying that eveything you want to do is doable now with HMM but, nothing
>>> preclude achieving what you want to achieve using ZONE_DEVICE. I really don't think
>>> any of the existing mm mechanism (kswapd, lru, numa, ...) are nice fit and can be reuse
>>> with device memory.
>>
>> With CDM node based design, the expectation is to get all/maximum core VM mechanism
>> working so that, driver has to do less device specific optimization.
>
> I think this is a bad idea, today, for GPU but i might be wrong.

Why do you think so? What aspects do you think are wrong? I am guessing you
mean that the GPU driver via the GEM/DRM/TTM layers should interact with the
mm and manage their own memory and use some form of TTM mm abstraction? I'll
study those systems if possible as well.

>
>>>
>>> Each device is so different from the other that i don't believe in a one API fit all.
>>
>> Right, so as I had mentioned in the cover letter, pglist_data->coherent_device actually
>> can become a bit mask indicating the type of coherent device the node is and that can
>> be used to implement multiple types of requirement in core mm for various kinds of
>> devices in the future.
>
> I really don't want to move GPU memory management into core mm, if you only concider GPGPU
> then it _might_ make sense but for graphic side i definitly don't think so. There are way
> to much device specific consideration to have in respect of memory management for GPU
> (not only in between different vendor but difference between different generation).
>

Yes, GPGPU is of interest. We don't look at it as GPU memory management. The memory
on the device is coherent, it is a part of the system. It comes online later and we would
like to hotplug it out if required. Since it's sitting on a bus, we do need optimizations
and the ability to migrate to and from it. I don't think it makes sense to replicate a
lot of the mm core logic to manage this memory, IMHO.

I think I'd like to point out is that it is wrong to assume only a GPU having coherent
memory, the RFC clarifies.

>
>>> The drm GPU subsystem of the kernel is a testimony of how little can be share when it
>>> comes to GPU. The only common code is modesetting. Everything that deals with how to
>>> use GPU to compute stuff is per device and most of the logic is in userspace. So i do
>>
>> Whats the basic reason which prevents such code/functionality sharing ?
>
> While the higher level API (OpenGL, OpenCL, Vulkan, Cuda, ...) offer an abstraction model,
> they are all different abstractions. They are just no way to have kernel expose a common
> API that would allow all of the above to be implemented.
>
> Each GPU have complex memory management and requirement (not only differ between vendor
> but also between generation of same vendor). They have different isa for each generation.
> They have different way to schedule job for each generation. They offer different sync
> mechanism. They have different page table format, mmu, ...
>

Agreed

> Basicly each GPU generation is a platform on it is own, like arm, ppc, x86, ... so i do
> not see a way to expose a common API and i don't think anyone who as work on any number
> of GPU see one either. I wish but it is just not the case.
>

We are trying to leverage the ability to see coherent memory (across a set of devices
plus system RAM) to keep memory management as simple as possible

>
>>> not see any commonality that could be abstracted at syscall level. I would rather let
>>> device driver stack (kernel and userspace) take such decision and have the higher level
>>> API (OpenCL, Cuda, C++17, ...) expose something that make sense for each of them.
>>> Programmer target those high level API and they intend to use the mechanism each offer
>>> to manage memory and memory placement. I would say forcing them to use a second linux
>>> specific API to achieve the latter is wrong, at lest for now.
>>
>> But going forward dont we want a more closely integrated coherent device solution
>> which does not depend too much on a device driver stack ? and can be used from a
>> basic user space program ?
>
> That is something i want, but i strongly believe we are not there yet, we have no real
> world experience. All we have in the open source community is the graphic stack (drm)
> and the graphic stack clearly shows that today there is no common denominator between
> GPU outside of modesetting.
>

:)

> So while i share the same aim, i think for now we need to have real experience. Once we
> have something like OpenCL >= 2.0, C++17 and couple other userspace API being actively
> use on linux with different coherent devices then we can start looking at finding a
> common denominator that make sense for enough devices.
>
> I am sure device driver would like to get rid of their custom memory management but i
> don't think this is applicable now. I fear existing mm code would always make the worst
> decision when it comes to memory placement, migration and reclaim.
>

Agreed, we don't want to make either placement/migration or reclaim slow. As I said earlier
we should not restrict our thinking to just GPU devices.

Balbir Singh.
Re: [RFC 0/8] Define coherent device memory node [ In reply to ]
On Thu, Oct 27, 2016 at 12:33:05PM +0530, Anshuman Khandual wrote:
> On 10/27/2016 10:08 AM, Anshuman Khandual wrote:
> > On 10/26/2016 09:32 PM, Jerome Glisse wrote:
> >> On Wed, Oct 26, 2016 at 04:43:10PM +0530, Anshuman Khandual wrote:
> >>> On 10/26/2016 12:22 AM, Jerome Glisse wrote:
> >>>> On Tue, Oct 25, 2016 at 11:01:08PM +0530, Aneesh Kumar K.V wrote:
> >>>>> Jerome Glisse <j.glisse@gmail.com> writes:
> >>>>>> On Tue, Oct 25, 2016 at 10:29:38AM +0530, Aneesh Kumar K.V wrote:
> >>>>>>> Jerome Glisse <j.glisse@gmail.com> writes:
> >>>>>>>> On Mon, Oct 24, 2016 at 10:01:49AM +0530, Anshuman Khandual wrote:

[...]

> >> In my patchset there is no policy, it is all under device driver control which
> >> decide what range of memory is migrated and when. I think only device driver as
> >> proper knowledge to make such decision. By coalescing data from GPU counters and
> >> request from application made through the uppler level programming API like
> >> Cuda.
> >>
> >
> > Right, I understand that. But what I pointed out here is that there are problems
> > now migrating user mapped pages back and forth between LRU system RAM memory and
> > non LRU device memory which is yet to be solved. Because you are proposing a non
> > LRU based design with ZONE_DEVICE, how we are solving/working around these
> > problems for bi-directional migration ?
>
> Let me elaborate on this bit more. Before non LRU migration support patch series
> from Minchan, it was not possible to migrate non LRU pages which are generally
> driver managed through migrate_pages interface. This was affecting the ability
> to do compaction on platforms which has a large share of non LRU pages. That series
> actually solved the migration problem and allowed compaction. But it still did not
> solve the migration problem for non LRU *user mapped* pages. So if the non LRU pages
> are mapped into a process's page table and being accessed from user space, it can
> not be moved using migrate_pages interface.
>
> Minchan had a draft solution for that problem which is still hosted here. On his
> suggestion I had tried this solution but still faced some other problems during
> mapped pages migration. (NOTE: IIRC this was not posted in the community)
>
> git://git.kernel.org/pub/scm/linux/kernel/git/minchan/linux.git with the following
> branch (non-lru-mapped-v1r2-v4.7-rc4-mmotm-2016-06-24-15-53)
>
> As I had mentioned earlier, we intend to support all possible migrations between
> system RAM (LRU) and device memory (Non LRU) for user space mapped pages.
>
> (1) System RAM (Anon mapping) --> Device memory, back and forth many times
> (2) System RAM (File mapping) --> Device memory, back and forth many times

I achieve this 2 objective in HMM, i sent you the additional patches for file
back page migration. I am not done working on them but they are small.


> This is not happening now with non LRU pages. Here are some of reasons but before
> that some notes.
>
> * Driver initiates all the migrations
> * Driver does the isolation of pages
> * Driver puts the isolated pages in a linked list
> * Driver passes the linked list to migrate_pages interface for migration
> * IIRC isolation of non LRU pages happens through page->as->aops->isolate_page call
> * If migration fails, call page->as->aops->putback_page to give the page back to the
> device driver
>
> 1. queue_pages_range() currently does not work with non LRU pages, needs to be fixed
>
> 2. After a successful migration from non LRU device memory to LRU system RAM, the non
> LRU will be freed back. Right now migrate_pages releases these pages to buddy, but
> in this situation we need the pages to be given back to the driver instead. Hence
> migrate_pages needs to be changed to accommodate this.
>
> 3. After LRU system RAM to non LRU device migration for a mapped page, does the new
> page (which came from device memory) will be part of core MM LRU either for Anon
> or File mapping ?
>
> 4. After LRU (Anon mapped) system RAM to non LRU device migration for a mapped page,
> how we are going to store "address_space->address_space_operations" and "Anon VMA
> Chain" reverse mapping information both on the page->mapping element ?
>
> 5. After LRU (File mapped) system RAM to non LRU device migration for a mapped page,
> how we are going to store "address_space->address_space_operations" of the device
> driver and radix tree based reverse mapping information for the existing file
> mapping both on the same page->mapping element ?
>
> 6. IIRC, it was not possible to retain the non LRU identify (page->as->aops which will
> defined inside the device driver) and the reverse mapping information (either anon
> or file mapping) together after first round of migration. This non LRU identity needs
> to be retained continuously if we ever need to return this page to device driver after
> successful migration to system RAM or for isolation/putback purpose or something else.
>
> All the reasons explained above was preventing a continuous ping-pong scheme of migration
> between system RAM LRU buddy pages and device memory non LRU pages which is one of the
> primary requirements for exploiting coherent device memory. Do you think we can solve these
> problems with ZONE_DEVICE and HMM framework ?

Well HMM already achieve migration but design is slightly different :
* Device driver initiate migration by calling hmm_migrate(mm, start, end, pfn_array);
It must provide a pfn_array that is big enough to have one entry per page for the
range (so ((end - start) >> PAGE_SHIFT) entries). With this array no list of page.

* hmm_migrate() collect source pages from the process. Right now it will only migrate
thing that have been faulted ie with a valid CPU page table entry and will ignore
swap entry, or any other special CPU page table entry. Those source pages are store
in the pfn array (using their pfn value with flag like write permission)

* hmm_migrate() isolate all lru pages collected in previous step. For ZONE_DEVICE pages
it does nothing. Non lru page can be migrated only if it is a ZONE_DEVICE page. Any
non lru page that is not ZONE_DEVICE is ignored.

* hmm_migrate() unmap all the pages and check the refcount. If there a page is pin then
it restore CPU page table, put back the page on lru (if it is not a ZONE_DEVICE page)
and clear the associated entry inside the pfn_array.

* hmm_migrate() use device driver callback alloc_and_copy() this device driver callback
will allocate destination device page and copy from the source page. It uses the pfn
array to know which page can be migrated in the range (there is a flag). The callback
must also update the pfn_array and replace any entry that was successfully allocated
and copied with the pfn of the device page (and flag).

* hmm_migrate() do the final struct page meta-data migration which might fail in case of
file back page (buffer head migration fails or radix tree fails ...)

* hmm_migrate() update the CPU page table ie remove migration special entry to point
to new page if migration successfull or restore to old page otherwise. It also unlock
page and call put_page() on them either through lru put back or directly for
ZONE_DEVICE pages.

* hmm_migrate() call cleanup() only now device driver can update its page table


I slightly changing the last 2 step, it would be call device driver callback first
and then restore CPU page table and device driver callback would be rename to
finalize_and_map().

So with this design:
1. is a non-issue (use of pfn array and not list of page).

2. is a non-issue successfull migration from ZONE_DEVICE (GPU memory) to system
memory call put_page() which in turn will call inside the device driver
to inform the device driver that page is free (assuming refcount on page
reach 1)

3. New page is not part of the LRU if it is a device page. Assumption is that the
device driver wants to manage its memory by itself and LRU would interfer with
that. Moreover this is a device page and thus it is not something that should be
use for emergency memory allocation or any regular allocation. So it is pointless
for kernel to try to keep aging those pages to see when they can be reclaim.

4. I do not store address_space operation of a device, i extended struct dev_pagemap
to have more callback and this can be access through struct page->pgmap
So the for ZONE_DEVICE page the page->mapping point to the expected page->mapping
ie for anonymous page it points to the anon vma chain and for file back page it
points to the address space of the filesystem on which the file is.

5. See 4 above

6. I do not store any device driver specific address space operation inside struct
page. I do not see the need for that and doing so would require major changes to
kernel mm code. All the device driver cares about is being told when a page is
free (as i am assuming device does the allocation in the first place).

It seems you want to rely on following struct address_space_operations callback:
void (*putback_page)(struct page *);
bool (*isolate_page)(struct page *, isolate_mode_t);
int (*migratepage) (...);

For putback_page i added a free_page() to struct dev_pagemap which does the job.
I do not see need for isolate_page() and it would be bad as some filesystem do
special thing in that callback. If you update the CPU page table the device should
see that and i do not think you would need any special handling inside the device
driver code.

For migratepage() again i do not see the use for it. Some fs have special callback
and that should be the one use.


So i really don't think we need to have an address_space for page that are coming
from device. I think we can add thing to struct dev_pagemap if needed.

Did i miss something ? :)

Cheers,
Jérôme
Re: [RFC 0/8] Define coherent device memory node [ In reply to ]
On 10/27/2016 08:35 PM, Jerome Glisse wrote:
> On Thu, Oct 27, 2016 at 12:33:05PM +0530, Anshuman Khandual wrote:
>> On 10/27/2016 10:08 AM, Anshuman Khandual wrote:
>>> On 10/26/2016 09:32 PM, Jerome Glisse wrote:
>>>> On Wed, Oct 26, 2016 at 04:43:10PM +0530, Anshuman Khandual wrote:
>>>>> On 10/26/2016 12:22 AM, Jerome Glisse wrote:
>>>>>> On Tue, Oct 25, 2016 at 11:01:08PM +0530, Aneesh Kumar K.V wrote:
>>>>>>> Jerome Glisse <j.glisse@gmail.com> writes:
>>>>>>>> On Tue, Oct 25, 2016 at 10:29:38AM +0530, Aneesh Kumar K.V wrote:
>>>>>>>>> Jerome Glisse <j.glisse@gmail.com> writes:
>>>>>>>>>> On Mon, Oct 24, 2016 at 10:01:49AM +0530, Anshuman Khandual wrote:
>
> [...]
>
>>>> In my patchset there is no policy, it is all under device driver control which
>>>> decide what range of memory is migrated and when. I think only device driver as
>>>> proper knowledge to make such decision. By coalescing data from GPU counters and
>>>> request from application made through the uppler level programming API like
>>>> Cuda.
>>>>
>>>
>>> Right, I understand that. But what I pointed out here is that there are problems
>>> now migrating user mapped pages back and forth between LRU system RAM memory and
>>> non LRU device memory which is yet to be solved. Because you are proposing a non
>>> LRU based design with ZONE_DEVICE, how we are solving/working around these
>>> problems for bi-directional migration ?
>>
>> Let me elaborate on this bit more. Before non LRU migration support patch series
>> from Minchan, it was not possible to migrate non LRU pages which are generally
>> driver managed through migrate_pages interface. This was affecting the ability
>> to do compaction on platforms which has a large share of non LRU pages. That series
>> actually solved the migration problem and allowed compaction. But it still did not
>> solve the migration problem for non LRU *user mapped* pages. So if the non LRU pages
>> are mapped into a process's page table and being accessed from user space, it can
>> not be moved using migrate_pages interface.
>>
>> Minchan had a draft solution for that problem which is still hosted here. On his
>> suggestion I had tried this solution but still faced some other problems during
>> mapped pages migration. (NOTE: IIRC this was not posted in the community)
>>
>> git://git.kernel.org/pub/scm/linux/kernel/git/minchan/linux.git with the following
>> branch (non-lru-mapped-v1r2-v4.7-rc4-mmotm-2016-06-24-15-53)
>>
>> As I had mentioned earlier, we intend to support all possible migrations between
>> system RAM (LRU) and device memory (Non LRU) for user space mapped pages.
>>
>> (1) System RAM (Anon mapping) --> Device memory, back and forth many times
>> (2) System RAM (File mapping) --> Device memory, back and forth many times
>
> I achieve this 2 objective in HMM, i sent you the additional patches for file
> back page migration. I am not done working on them but they are small.

Sure, will go through them. Thanks !

>
>
>> This is not happening now with non LRU pages. Here are some of reasons but before
>> that some notes.
>>
>> * Driver initiates all the migrations
>> * Driver does the isolation of pages
>> * Driver puts the isolated pages in a linked list
>> * Driver passes the linked list to migrate_pages interface for migration
>> * IIRC isolation of non LRU pages happens through page->as->aops->isolate_page call
>> * If migration fails, call page->as->aops->putback_page to give the page back to the
>> device driver
>>
>> 1. queue_pages_range() currently does not work with non LRU pages, needs to be fixed
>>
>> 2. After a successful migration from non LRU device memory to LRU system RAM, the non
>> LRU will be freed back. Right now migrate_pages releases these pages to buddy, but
>> in this situation we need the pages to be given back to the driver instead. Hence
>> migrate_pages needs to be changed to accommodate this.
>>
>> 3. After LRU system RAM to non LRU device migration for a mapped page, does the new
>> page (which came from device memory) will be part of core MM LRU either for Anon
>> or File mapping ?
>>
>> 4. After LRU (Anon mapped) system RAM to non LRU device migration for a mapped page,
>> how we are going to store "address_space->address_space_operations" and "Anon VMA
>> Chain" reverse mapping information both on the page->mapping element ?
>>
>> 5. After LRU (File mapped) system RAM to non LRU device migration for a mapped page,
>> how we are going to store "address_space->address_space_operations" of the device
>> driver and radix tree based reverse mapping information for the existing file
>> mapping both on the same page->mapping element ?
>>
>> 6. IIRC, it was not possible to retain the non LRU identify (page->as->aops which will
>> defined inside the device driver) and the reverse mapping information (either anon
>> or file mapping) together after first round of migration. This non LRU identity needs
>> to be retained continuously if we ever need to return this page to device driver after
>> successful migration to system RAM or for isolation/putback purpose or something else.
>>
>> All the reasons explained above was preventing a continuous ping-pong scheme of migration
>> between system RAM LRU buddy pages and device memory non LRU pages which is one of the
>> primary requirements for exploiting coherent device memory. Do you think we can solve these
>> problems with ZONE_DEVICE and HMM framework ?
>
> Well HMM already achieve migration but design is slightly different :
> * Device driver initiate migration by calling hmm_migrate(mm, start, end, pfn_array);
> It must provide a pfn_array that is big enough to have one entry per page for the
> range (so ((end - start) >> PAGE_SHIFT) entries). With this array no list of page.

If we are not going to use standard core migrate_pages() interface, there is no need
of building a linked list of isolated source pages for migration. Though I see a
different hmm_migrate() function in the V13 tree which involves hmm_migrate structure,
lets focus on hmm_migrate(mm, start, end, pfn_array) format. I guess (mm, start, end)
describes the virtual range of a process which needs to be migrated and pfn_array[]
is the destination array of PFNs for the migration ?

* I assume pfn_array[] can contain either system RAM PFN or device memory PFN ? It
will support migration in both directions ?

* Device memory PFN can have struct pages (If ZONE_DEVICE based) or it may not have
struct pages ?
>
> * hmm_migrate() collect source pages from the process. Right now it will only migrate
> thing that have been faulted ie with a valid CPU page table entry and will ignore
> swap entry, or any other special CPU page table entry. Those source pages are store
> in the pfn array (using their pfn value with flag like write permission)

So source PFNs go into pfn_array[], I was thinking it contains destination PFNs.

>
> * hmm_migrate() isolate all lru pages collected in previous step. For ZONE_DEVICE pages
> it does nothing. Non lru page can be migrated only if it is a ZONE_DEVICE page. Any
> non lru page that is not ZONE_DEVICE is ignored.

Hmm, may be because it does not have either page->pgmap (which you have extended to
contain some driver specific callbacks) or page->as->aops (Minchan Kim's framework).
Therefore any other kind of non LRU pages cannot migrate.

>
> * hmm_migrate() unmap all the pages and check the refcount. If there a page is pin then
> it restore CPU page table, put back the page on lru (if it is not a ZONE_DEVICE page)
> and clear the associated entry inside the pfn_array.

Got it. pfn_array[] at the end will contain all PFNs which need to be migrated.

>
> * hmm_migrate() use device driver callback alloc_and_copy() this device driver callback
> will allocate destination device page and copy from the source page. It uses the pfn

So if the migration is from device to system RAM, alloc_and_copy() will allocate the
destination system RAM pages and at that time pfn_array[] contains source device memory
PFNs ? I am just trying see if it works both ways.

> array to know which page can be migrated in the range (there is a flag). The callback
> must also update the pfn_array and replace any entry that was successfully allocated
> and copied with the pfn of the device page (and flag).
>
> * hmm_migrate() do the final struct page meta-data migration which might fail in case of
> file back page (buffer head migration fails or radix tree fails ...)
>
> * hmm_migrate() update the CPU page table ie remove migration special entry to point
> to new page if migration successfull or restore to old page otherwise. It also unlock
> page and call put_page() on them either through lru put back or directly for
> ZONE_DEVICE pages.

If it's a ZONE_DEVICE page, the registered device driver also gets notified about it ?
So that it can update it's own accounting regarding the allocated and free memory pages
that it owns through a hot plugged ZONE_DEVICE zone ?

>
> * hmm_migrate() call cleanup() only now device driver can update its page table

Though I still need to understand the page table mirroring part, I can clearly see
that hmm_migrate() attempts to implement a parallel migrate_pages() kind of interface
which can work with non LRU pages (right now ZONE_DEVICE based only) and a device
driver. We will have to see whether this hmm_migrate() interface can accommodate all
kind and direction of migration.

Minchan Kim's framework enabled non LRU page migration in a different way. The device
driver is suppose to create a stand alone struct address_space_operation and struct
address_space and load them into each struct page with a call. Now all non LRU pages
contains the stand alone struct address_space_operations as page->as->aops based
callbacks.

Now we have a different way of enabling non LRU device page migration by extending
ZONE_DEVICE framework, does it overlap with the functionality already supported
by the previous framework ? I am just curious.

>
>
> I slightly changing the last 2 step, it would be call device driver callback first
> and then restore CPU page table and device driver callback would be rename to
> finalize_and_map().
>
> So with this design:
> 1. is a non-issue (use of pfn array and not list of page).

Right.

>
> 2. is a non-issue successfull migration from ZONE_DEVICE (GPU memory) to system
> memory call put_page() which in turn will call inside the device driver
> to inform the device driver that page is free (assuming refcount on page
> reach 1)

Right.

>
> 3. New page is not part of the LRU if it is a device page. Assumption is that the
> device driver wants to manage its memory by itself and LRU would interfer with
> that. Moreover this is a device page and thus it is not something that should be
> use for emergency memory allocation or any regular allocation. So it is pointless
> for kernel to try to keep aging those pages to see when they can be reclaim.

If the driver manages everything, these device memory pages need not be on the LRU after
migration. But not being on any LRU makes it difficult for other core MM features to work
on these pages any more. Almost all core mm interfaces expect the pages to be on LRU, IIUC.
Though they all can be changed to accommodate non LRU pages but dont you think that can be
a lot of work ? Just curious.

>
> 4. I do not store address_space operation of a device, i extended struct dev_pagemap
> to have more callback and this can be access through struct page->pgmap
> So the for ZONE_DEVICE page the page->mapping point to the expected page->mapping
> ie for anonymous page it points to the anon vma chain and for file back page it
> points to the address space of the filesystem on which the file is.

Right.

>
> 5. See 4 above

Right.

>
> 6. I do not store any device driver specific address space operation inside struct
> page. I do not see the need for that and doing so would require major changes to
> kernel mm code. All the device driver cares about is being told when a page is
> free (as i am assuming device does the allocation in the first place).
>

Minchan's work introduced the idea of PageMovable (IIUC, it just says its a movable
non LRU page with page->mapping->aops and some struct page flags) and changed parts
of the core MM migration and compaction functions to accommodate MovablePage.

> It seems you want to rely on following struct address_space_operations callback:
> void (*putback_page)(struct page *);
> bool (*isolate_page)(struct page *, isolate_mode_t);
> int (*migratepage) (...);
>
> For putback_page i added a free_page() to struct dev_pagemap which does the job.

Right, sounds correct from this ZONE_DEVICE based framework.

> I do not see need for isolate_page() and it would be bad as some filesystem do
> special thing in that callback. If you update the CPU page table the device should

It was a dummy device driver specific address_space_operations, hence its not related
to any file system as such.

> see that and i do not think you would need any special handling inside the device
> driver code.

I need to understand this part. How a call back from CPU page table update comes to
the device driver, will go through HMM V13 for that.

>
> For migratepage() again i do not see the use for it. Some fs have special callback
> and that should be the one use.
>
>
> So i really don't think we need to have an address_space for page that are coming
> from device. I think we can add thing to struct dev_pagemap if needed.

Right, sounds correct from this ZONE_DEVICE based framework.

>
> Did i miss something ? :)

Will have more questions after looking deeper into HMM :)
Re: [RFC 0/8] Define coherent device memory node [ In reply to ]
Jerome Glisse <j.glisse@gmail.com> writes:

> On Wed, Oct 26, 2016 at 04:39:19PM +0530, Aneesh Kumar K.V wrote:
>> Jerome Glisse <j.glisse@gmail.com> writes:
>>
>> > On Tue, Oct 25, 2016 at 09:56:35AM +0530, Aneesh Kumar K.V wrote:
>> >> Jerome Glisse <j.glisse@gmail.com> writes:
>> >>
>> >> > On Mon, Oct 24, 2016 at 10:01:49AM +0530, Anshuman Khandual wrote:
>> >> >
>> >> I looked at the hmm-v13 w.r.t migration and I guess some form of device
>> >> callback/acceleration during migration is something we should definitely
>> >> have. I still haven't figured out how non addressable and coherent device
>> >> memory can fit together there. I was waiting for the page cache
>> >> migration support to be pushed to the repository before I start looking
>> >> at this closely.
>> >>
>> >
>> > The page cache migration does not touch the migrate code path. My issue with
>> > page cache is writeback. The only difference with existing migrate code is
>> > refcount check for ZONE_DEVICE page. Everything else is the same.
>>
>> What about the radix tree ? does file system migrate_page callback handle
>> replacing normal page with ZONE_DEVICE page/exceptional entries ?
>>
>
> It use the exact same existing code (from mm/migrate.c) so yes the radix tree
> is updated and buffer_head are migrated.
>

I looked at the the page cache migration patches shared and I find that
you are not using exceptional entries when we migrate a page cache page to
device memory. But I am now not sure how a read from page cache will
work with that.

ie, a file system read will now find the page in page cache. But we
cannot do a copy_to_user of that page because that is now backed by an
unaddressable memory right ?

do_generic_file_read() does
page = find_get_page(mapping, index);
....
ret = copy_page_to_iter(page, offset, nr, iter);

which does
void *kaddr = kmap_atomic(page);
size_t wanted = copy_to_iter(kaddr + offset, bytes, i);
kunmap_atomic(kaddr);


-aneesh

1 2  View All