Mailing List Archive

[PATCH RFC 0/7] block: Introduce CBD (CXL Block Device)
From: Dongsheng Yang <dongsheng.yang.linux@gmail.com>

Hi all,
This patchset introduce cbd (CXL block device). It's based on linux 6.8, and available at:
https://github.com/DataTravelGuide/linux

(1) What is cbd:
As shared memory is supported in CXL3.0 spec, we can transfer data
via CXL shared memory. CBD means CXL block device, it use CXL shared memory
to transfer command and data to access block device in different
host, as shown below:

????????????????????????????????? ??????????????????????????????????????
? node-1 ? ? node-2 ?
????????????????????????????????? ??????????????????????????????????????
? ? ? ?
? ????????? ??????????? ?
? ? cbd0 ? ? backend0???????????????????? ?
? ????????? ??????????? ? ?
? ? pmem0 ? ? pmem0 ? ? ?
? ????????????????? ???????????????? ?????????????????
? ? cxl driver ? ? cxl driver ? ? /dev/sda ?
????????????????????????????????? ??????????????????????????????????????
? ?
? ?
? CXL CXL ?
?????????????????? ?????????????
? ?
? ?
? ?
?????????????????????????---------------??
? shared memory device(cbd transport) ?
???????????????????????---------------????

any read/write to cbd0 on node-1 will be transferred to node-2 /dev/sda. It works similar with
nbd (network block device), but it transfer data via CXL shared memory rather than network.

(2) Layout of transport:

?????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????
? cbd transport ?
?????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????
? ? hosts ? backends ? blkdevs ? channels ?
? cbd transport info ????????????????????????????????????????????????????????????????????????????????????????????????????????????
? ? ? ? ? ... ? ? ? ? ... ? ? ? ? ... ? ? ? ? ... ?
?????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????
?
?
?
?
???????????????????????????????????????????????????????????????????????????????????????
?
?
?
?????????????????????????????????????????????????????????????
? channel ?
?????????????????????????????????????????????????????????????
? channel meta ? channel data ?
??????????????????????????????????????????????????????-??????
?
?
?
?
????????????????????????????????????????????????????????????
? channel meta ?
????????????????????????????????????????????????????????????
? meta ctrl ? comp ring ? cmd ring ?
????????????????????????????????????????????????????????????

The shared memory is divided into five regions:

a) Transport_info:
Information about the overall transport, including the layout of the transport.
b) Hosts:
Each host wishing to utilize this transport needs to register its own information within a host entry in this region.
c) Backends:
Starting a backend on a host requires filling in information in a backend entry within this region.
d) Blkdevs:
Once a backend is established, it can be mapped to any associated host. The information about the blkdevs is then filled into the blkdevs region.
e) Channels:
This is the actual data communication area, where communication between blkdev and backend occurs. Each queue of a block device uses a channel, and each backend has a corresponding handler interacting with this queue.
f) Channel:
Channel is further divided into meta and data regions.
The meta region includes cmd rings and comp rings. The blkdev converts upper-layer requests into cbd_se and fills them into the cmd ring.
The handler accepts the cbd_se from the cmd ring and sends them to the local actual block device of the backend (e.g., sda).
After completion, the results are formed into cbd_ce and filled into the comp ring.
The blkdev then receives the cbd_ce and returns the results to the upper-layer IO sender.

Currently, the number of entries in each region and the channel size are both set to default values. In the future, they will be made configurable.

(3) Naming of CBD:
Actually it is not strictly depends on CXL, any shared memory can be used for cbd, but
I did not find out a better name, maybe smxbd(shared memory transport block device)? I choose
CBD as it sounds more concise and elegant. Any suggestion?

(4) dax is not supported yet:
same with famfs, dax device is not supported here, because dax device does not support
dev_dax_iomap so far. Once dev_dax_iomap is supported, CBD can easily support DAX mode.

(5) How do blkdev and backend interact through the channel?
a) For reader side, before reading the data, if the data in this channel may be modified by the other party, then I need to flush the cache before reading to ensure that I get the latest data. For example, the blkdev needs to flush the cache before obtaining compr_head because compr_head will be updated by the backend handler.
b) For writter side, if the written information will be read by others, then after writing, I need to flush the cache to let the other party see it immediately. For example, after blkdev submits cbd_se, it needs to update cmd_head to let the handler have a new cbd_se. Therefore, after updating cmd_head, I need to flush the cache to let the backend see it.

(6) race between management operations:
There may be a race condition, for example: if we use backend-start on different nodes at the same time,
it's possible to allocate the same backend ID. This issue should be handled by the upper-layer
manager, ensuring that all management operations are serialized, such as acquiring a distributed lock.

(7) What's Next?
This is an first version of CBD, and there is still much work to be done, such as: how to recover a backend service when a backend node fails? How to gracefully stop associated blkdev when a backend service cannot be recovered? How to clear dead information within the transport layer? For non-volatile memory transport, it may be considered to allocate a new area as a Write-Ahead Log (WAL).

(8) testing with qemu:
We can use two QEMU virtual machines to test CBD by sharing a CXLMemDev:

a) Start two QEMU virtual machines, sharing a CXLMemDev.

root@qemu-2:~# cxl list
[
{
"memdev":"mem0",
"pmem_size":536870912,
"serial":0,
"host":"0000:0d:00.0"
}
]

root@qemu-1:~# cxl list
[
{
"memdev":"mem0",
"pmem_size":536870912,
"serial":0,
"host":"0000:0d:00.0"
}
]

b) Register a CBD transport on node-1 and add a backend, specifying the path as /dev/ram0p1.
root@qemu-1:~# cxl create-region -m mem0 -d decoder0.0 -t pmem
{
"region":"region0",
"resource":"0x1890000000",
"size":"512.00 MiB (536.87 MB)",
"type":"pmem",
"interleave_ways":1,
"interleave_granularity":256,
"decode_state":"commit",
"mappings":[
{
"position":0,
"memdev":"mem0",
"decoder":"decoder2.0"
}
]
}
cxl region: cmd_create_region: created 1 region
root@qemu-1:~# ndctl create-namespace -r region0 -m fsdax --map dev -t pmem

{
"dev":"namespace0.0",
"mode":"fsdax",
"map":"dev",
"size":"502.00 MiB (526.39 MB)",
"uuid":"618e9627-4345-4046-ba46-becf430a1464",
"sector_size":512,
"align":2097152,
"blockdev":"pmem0"
}
root@qemu-1:~# echo "path=/dev/pmem0,hostname=node-1,force=1,format=1" > /sys/bus/cbd/transport_register
root@qemu-1:~# echo "op=backend-start,path=/dev/ram0p1" > /sys/bus/cbd/devices/transport0/adm

c) Register a CBD transport on node-2 and add a blkdev, specifying the backend ID as the backend on node-1.
root@qemu-2:~# cxl create-region -m mem0 -d decoder0.0 -t pmem
{
"region":"region0",
"resource":"0x390000000",
"size":"512.00 MiB (536.87 MB)",
"type":"pmem",
"interleave_ways":1,
"interleave_granularity":256,
"decode_state":"commit",
"mappings":[
{
"position":0,
"memdev":"mem0",
"decoder":"decoder2.0"
}
]
}
cxl region: cmd_create_region: created 1 region
root@qemu-2:~# ndctl create-namespace -r region0 -m fsdax --map dev -t pmem -b 0
{
"dev":"namespace0.0",
"mode":"fsdax",
"map":"dev",
"size":"502.00 MiB (526.39 MB)",
"uuid":"a7fae1a5-2cba-46d7-83a2-20a76d736848",
"sector_size":512,
"align":2097152,
"blockdev":"pmem0"
}
root@qemu-2:~# echo "path=/dev/pmem0,hostname=node-2" > /sys/bus/cbd/transport_register
root@qemu-2:~# echo "op=dev-start,backend_id=0,queues=1" > /sys/bus/cbd/devices/transport0/adm

d) On node-2, you will get a /dev/cbd0, and all reads and writes to cbd0 will actually read from and write to /dev/ram0p1 on node-1.
root@qemu-2:~# mkfs.xfs -f /dev/cbd0
meta-data=/dev/cbd0 isize=512 agcount=4, agsize=655360 blks
= sectsz=4096 attr=2, projid32bit=1
= crc=1 finobt=1, sparse=1, rmapbt=0
= reflink=1 bigtime=0 inobtcount=0
data = bsize=4096 blocks=2621440, imaxpct=25
= sunit=0 swidth=0 blks
naming =version 2 bsize=4096 ascii-ci=0, ftype=1
log =internal log bsize=4096 blocks=2560, version=2
= sectsz=4096 sunit=1 blks, lazy-count=1
realtime =none extsz=4096 blocks=0, rtextents=0


Thanx

Dongsheng Yang (7):
block: Init for CBD(CXL Block Device)
cbd: introduce cbd_transport
cbd: introduce cbd_channel
cbd: introduce cbd_host
cbd: introuce cbd_backend
cbd: introduce cbd_blkdev
cbd: add related sysfs files in transport register

drivers/block/Kconfig | 2 +
drivers/block/Makefile | 2 +
drivers/block/cbd/Kconfig | 4 +
drivers/block/cbd/Makefile | 3 +
drivers/block/cbd/cbd_backend.c | 254 +++++++++
drivers/block/cbd/cbd_blkdev.c | 375 +++++++++++++
drivers/block/cbd/cbd_channel.c | 179 +++++++
drivers/block/cbd/cbd_handler.c | 261 +++++++++
drivers/block/cbd/cbd_host.c | 123 +++++
drivers/block/cbd/cbd_internal.h | 830 +++++++++++++++++++++++++++++
drivers/block/cbd/cbd_main.c | 230 ++++++++
drivers/block/cbd/cbd_queue.c | 621 ++++++++++++++++++++++
drivers/block/cbd/cbd_transport.c | 845 ++++++++++++++++++++++++++++++
13 files changed, 3729 insertions(+)
create mode 100644 drivers/block/cbd/Kconfig
create mode 100644 drivers/block/cbd/Makefile
create mode 100644 drivers/block/cbd/cbd_backend.c
create mode 100644 drivers/block/cbd/cbd_blkdev.c
create mode 100644 drivers/block/cbd/cbd_channel.c
create mode 100644 drivers/block/cbd/cbd_handler.c
create mode 100644 drivers/block/cbd/cbd_host.c
create mode 100644 drivers/block/cbd/cbd_internal.h
create mode 100644 drivers/block/cbd/cbd_main.c
create mode 100644 drivers/block/cbd/cbd_queue.c
create mode 100644 drivers/block/cbd/cbd_transport.c

--
2.34.1
Re: [PATCH RFC 0/7] block: Introduce CBD (CXL Block Device) [ In reply to ]
Dongsheng Yang wrote:
> From: Dongsheng Yang <dongsheng.yang.linux@gmail.com>
>
> Hi all,
> This patchset introduce cbd (CXL block device). It's based on linux 6.8, and available at:
> https://github.com/DataTravelGuide/linux
>
[..]
> (4) dax is not supported yet:
> same with famfs, dax device is not supported here, because dax device does not support
> dev_dax_iomap so far. Once dev_dax_iomap is supported, CBD can easily support DAX mode.

I am glad that famfs is mentioned here, it demonstrates you know about
it. However, unfortunately this cover letter does not offer any analysis
of *why* the Linux project should consider this additional approach to
the inter-host shared-memory enabling problem.

To be clear I am neutral at best on some of the initiatives around CXL
memory sharing vs pooling, but famfs at least jettisons block-devices
and gets closer to a purpose-built memory semantic.

So my primary question is why would Linux need both famfs and cbd? I am
sure famfs would love feedback and help vs developing competing efforts.
Re: [PATCH RFC 0/7] block: Introduce CBD (CXL Block Device) [ In reply to ]
?? 2024/4/24 ?????? ???? 12:29, Dan Williams ???:
> Dongsheng Yang wrote:
>> From: Dongsheng Yang <dongsheng.yang.linux@gmail.com>
>>
>> Hi all,
>> This patchset introduce cbd (CXL block device). It's based on linux 6.8, and available at:
>> https://github.com/DataTravelGuide/linux
>>
> [..]
>> (4) dax is not supported yet:
>> same with famfs, dax device is not supported here, because dax device does not support
>> dev_dax_iomap so far. Once dev_dax_iomap is supported, CBD can easily support DAX mode.
>
> I am glad that famfs is mentioned here, it demonstrates you know about
> it. However, unfortunately this cover letter does not offer any analysis
> of *why* the Linux project should consider this additional approach to
> the inter-host shared-memory enabling problem.
>
> To be clear I am neutral at best on some of the initiatives around CXL
> memory sharing vs pooling, but famfs at least jettisons block-devices
> and gets closer to a purpose-built memory semantic.
>
> So my primary question is why would Linux need both famfs and cbd? I am
> sure famfs would love feedback and help vs developing competing efforts.

Hi,
Thanks for your reply, IIUC about FAMfs, the data in famfs is stored in
shared memory, and related nodes can share the data inside this file
system; whereas cbd does not store data in shared memory, it uses shared
memory as a channel for data transmission, and the actual data is stored
in the backend block device of remote nodes. In cbd, shared memory works
more like network to connect different hosts.

That is to say, in my view, FAMfs and cbd do not conflict at all; they
meet different scenario requirements. cbd simply uses shared memory to
transmit data, shared memory plays the role of a data transmission
channel, while in FAMfs, shared memory serves as a data store role.

Please correct me if I am wrong.

Thanx
> .
>
Re: [PATCH RFC 0/7] block: Introduce CBD (CXL Block Device) [ In reply to ]
On Wed, Apr 24, 2024 at 02:33:28PM +0800, Dongsheng Yang wrote:
>
>
> ? 2024/4/24 ??? ?? 12:29, Dan Williams ??:
> > Dongsheng Yang wrote:
> > > From: Dongsheng Yang <dongsheng.yang.linux@gmail.com>
> > >
> > > Hi all,
> > > This patchset introduce cbd (CXL block device). It's based on linux 6.8, and available at:
> > > https://github.com/DataTravelGuide/linux
> > >
> > [..]
> > > (4) dax is not supported yet:
> > > same with famfs, dax device is not supported here, because dax device does not support
> > > dev_dax_iomap so far. Once dev_dax_iomap is supported, CBD can easily support DAX mode.
> >
> > I am glad that famfs is mentioned here, it demonstrates you know about
> > it. However, unfortunately this cover letter does not offer any analysis
> > of *why* the Linux project should consider this additional approach to
> > the inter-host shared-memory enabling problem.
> >
> > To be clear I am neutral at best on some of the initiatives around CXL
> > memory sharing vs pooling, but famfs at least jettisons block-devices
> > and gets closer to a purpose-built memory semantic.
> >
> > So my primary question is why would Linux need both famfs and cbd? I am
> > sure famfs would love feedback and help vs developing competing efforts.
>
> Hi,
> Thanks for your reply, IIUC about FAMfs, the data in famfs is stored in
> shared memory, and related nodes can share the data inside this file system;
> whereas cbd does not store data in shared memory, it uses shared memory as a
> channel for data transmission, and the actual data is stored in the backend
> block device of remote nodes. In cbd, shared memory works more like network
> to connect different hosts.
>

Couldn't you basically just allocate a file for use as a uni-directional
buffer on top of FAMFS and achieve the same thing without the need for
additional kernel support? Similar in a sense to allocating a file on
network storage and pinging the remote host when it's ready (except now
it's fast!)

(The point here is not "FAMFS is better" or "CBD is better", simply
trying to identify the function that will ultimately dictate the form).

~Gregory
Re: [PATCH RFC 0/7] block: Introduce CBD (CXL Block Device) [ In reply to ]
Dongsheng Yang wrote:
>
>
> ? 2024/4/24 ??? ?? 12:29, Dan Williams ??:
> > Dongsheng Yang wrote:
> >> From: Dongsheng Yang <dongsheng.yang.linux@gmail.com>
> >>
> >> Hi all,
> >> This patchset introduce cbd (CXL block device). It's based on linux 6.8, and available at:
> >> https://github.com/DataTravelGuide/linux
> >>
> > [..]
> >> (4) dax is not supported yet:
> >> same with famfs, dax device is not supported here, because dax device does not support
> >> dev_dax_iomap so far. Once dev_dax_iomap is supported, CBD can easily support DAX mode.
> >
> > I am glad that famfs is mentioned here, it demonstrates you know about
> > it. However, unfortunately this cover letter does not offer any analysis
> > of *why* the Linux project should consider this additional approach to
> > the inter-host shared-memory enabling problem.
> >
> > To be clear I am neutral at best on some of the initiatives around CXL
> > memory sharing vs pooling, but famfs at least jettisons block-devices
> > and gets closer to a purpose-built memory semantic.
> >
> > So my primary question is why would Linux need both famfs and cbd? I am
> > sure famfs would love feedback and help vs developing competing efforts.
>
> Hi,
> Thanks for your reply, IIUC about FAMfs, the data in famfs is stored in
> shared memory, and related nodes can share the data inside this file
> system; whereas cbd does not store data in shared memory, it uses shared
> memory as a channel for data transmission, and the actual data is stored
> in the backend block device of remote nodes. In cbd, shared memory works
> more like network to connect different hosts.
>
> That is to say, in my view, FAMfs and cbd do not conflict at all; they
> meet different scenario requirements. cbd simply uses shared memory to
> transmit data, shared memory plays the role of a data transmission
> channel, while in FAMfs, shared memory serves as a data store role.

If shared memory is just a communication transport then a block-device
abstraction does not seem a proper fit. From the above description this
sounds similar to what CONFIG_NTB_TRANSPORT offers which is a way for
two hosts to communicate over a shared memory channel.

So, I am not really looking for an analysis of famfs vs CBD I am looking
for CBD to clarify why Linux should consider it, and why the
architecture is fit for purpose.
Re: [PATCH RFC 0/7] block: Introduce CBD (CXL Block Device) [ In reply to ]
? 2024/4/24 ??? ?? 11:14, Gregory Price ??:
> On Wed, Apr 24, 2024 at 02:33:28PM +0800, Dongsheng Yang wrote:
>>
>>
>> ? 2024/4/24 ??? ?? 12:29, Dan Williams ??:
>>> Dongsheng Yang wrote:
>>>> From: Dongsheng Yang <dongsheng.yang.linux@gmail.com>
>>>>
>>>> Hi all,
>>>> This patchset introduce cbd (CXL block device). It's based on linux 6.8, and available at:
>>>> https://github.com/DataTravelGuide/linux
>>>>
>>> [..]
>>>> (4) dax is not supported yet:
>>>> same with famfs, dax device is not supported here, because dax device does not support
>>>> dev_dax_iomap so far. Once dev_dax_iomap is supported, CBD can easily support DAX mode.
>>>
>>> I am glad that famfs is mentioned here, it demonstrates you know about
>>> it. However, unfortunately this cover letter does not offer any analysis
>>> of *why* the Linux project should consider this additional approach to
>>> the inter-host shared-memory enabling problem.
>>>
>>> To be clear I am neutral at best on some of the initiatives around CXL
>>> memory sharing vs pooling, but famfs at least jettisons block-devices
>>> and gets closer to a purpose-built memory semantic.
>>>
>>> So my primary question is why would Linux need both famfs and cbd? I am
>>> sure famfs would love feedback and help vs developing competing efforts.
>>
>> Hi,
>> Thanks for your reply, IIUC about FAMfs, the data in famfs is stored in
>> shared memory, and related nodes can share the data inside this file system;
>> whereas cbd does not store data in shared memory, it uses shared memory as a
>> channel for data transmission, and the actual data is stored in the backend
>> block device of remote nodes. In cbd, shared memory works more like network
>> to connect different hosts.
>>
>
> Couldn't you basically just allocate a file for use as a uni-directional
> buffer on top of FAMFS and achieve the same thing without the need for
> additional kernel support? Similar in a sense to allocating a file on
> network storage and pinging the remote host when it's ready (except now
> it's fast!)

I'm not entirely sure I follow your suggestion. I guess it means that
cbd would no longer directly manage the pmem device, but allocate files
on famfs to transfer data. I didn't do it this way because I considered
at least a few points: one of them is, cbd_transport actually requires a
DAX device to access shared memory, and cbd has very simple requirements
for space management, so there's no need to rely on a file system layer,
which would increase architectural complexity.

However, we still need cbd_blkdev to provide a block device, so it
doesn't achieve "achieve the same without the need for additional kernel
support".

Could you please provide more specific details about your suggestion?
>
> (The point here is not "FAMFS is better" or "CBD is better", simply
> trying to identify the function that will ultimately dictate the form).

Thank you for your clarification. totally aggree with it, discussions
always make the issues clearer.

Thanx
>
> ~Gregory
>
Re: [PATCH RFC 0/7] block: Introduce CBD (CXL Block Device) [ In reply to ]
On Fri, Apr 26, 2024 at 09:25:53AM +0800, Dongsheng Yang wrote:
>
>
> ? 2024/4/24 ??? ?? 11:14, Gregory Price ??:
> > On Wed, Apr 24, 2024 at 02:33:28PM +0800, Dongsheng Yang wrote:
> > >
> > >
> > > ? 2024/4/24 ??? ?? 12:29, Dan Williams ??:
> > > > Dongsheng Yang wrote:
> > > > > From: Dongsheng Yang <dongsheng.yang.linux@gmail.com>
> > > > >
> > > > > Hi all,
> > > > > This patchset introduce cbd (CXL block device). It's based on linux 6.8, and available at:
> > > > > https://github.com/DataTravelGuide/linux
> > > > >
> > > > [..]
> > > > > (4) dax is not supported yet:
> > > > > same with famfs, dax device is not supported here, because dax device does not support
> > > > > dev_dax_iomap so far. Once dev_dax_iomap is supported, CBD can easily support DAX mode.
> > > >
> > > > I am glad that famfs is mentioned here, it demonstrates you know about
> > > > it. However, unfortunately this cover letter does not offer any analysis
> > > > of *why* the Linux project should consider this additional approach to
> > > > the inter-host shared-memory enabling problem.
> > > >
> > > > To be clear I am neutral at best on some of the initiatives around CXL
> > > > memory sharing vs pooling, but famfs at least jettisons block-devices
> > > > and gets closer to a purpose-built memory semantic.
> > > >
> > > > So my primary question is why would Linux need both famfs and cbd? I am
> > > > sure famfs would love feedback and help vs developing competing efforts.
> > >
> > > Hi,
> > > Thanks for your reply, IIUC about FAMfs, the data in famfs is stored in
> > > shared memory, and related nodes can share the data inside this file system;
> > > whereas cbd does not store data in shared memory, it uses shared memory as a
> > > channel for data transmission, and the actual data is stored in the backend
> > > block device of remote nodes. In cbd, shared memory works more like network
> > > to connect different hosts.
> > >
> >
> > Couldn't you basically just allocate a file for use as a uni-directional
> > buffer on top of FAMFS and achieve the same thing without the need for
> > additional kernel support? Similar in a sense to allocating a file on
> > network storage and pinging the remote host when it's ready (except now
> > it's fast!)
>
> I'm not entirely sure I follow your suggestion. I guess it means that cbd
> would no longer directly manage the pmem device, but allocate files on famfs
> to transfer data. I didn't do it this way because I considered at least a
> few points: one of them is, cbd_transport actually requires a DAX device to
> access shared memory, and cbd has very simple requirements for space
> management, so there's no need to rely on a file system layer, which would
> increase architectural complexity.
>
> However, we still need cbd_blkdev to provide a block device, so it doesn't
> achieve "achieve the same without the need for additional kernel support".
>
> Could you please provide more specific details about your suggestion?

Fundamentally you're shuffling bits from one place to another, the
ultimate target is storage located on another device as opposed to
the memory itself. So you're using CXL as a transport medium.

Could you not do the same thing with a file in FAMFS, and put all of
the transport logic in userland? Then you'd just have what looks like
a kernel bypass transport mechanism built on top of a file backed by
shared memory.

Basically it's unclear to me why this must be done in the kernel.
Performance? Explicit bypass? Some technical reason I'm missing?


Also, on a tangential note, you're using pmem/qemu to emulate the
behavior of shared CXL memory. You should probably explain the
coherence implications of the system more explicitly.

The emulated system implements what amounts to hardware-coherent
memory (i.e. the two QEMU machines run on the same physical machine,
so coherency is managed within the same coherence domain).

If there is no explicit coherence control in software, then it is
important to state that this system relies on hardware that implements
snoop back-invalidate (which is not a requirement of a CXL 3.x device,
just a feature described by the spec that may be implemented).

~Gregory
Re: [PATCH RFC 0/7] block: Introduce CBD (CXL Block Device) [ In reply to ]
? 2024/4/26 ??? ?? 9:48, Gregory Price ??:
> On Fri, Apr 26, 2024 at 09:25:53AM +0800, Dongsheng Yang wrote:
>>
>>
>> ? 2024/4/24 ??? ?? 11:14, Gregory Price ??:
>>> On Wed, Apr 24, 2024 at 02:33:28PM +0800, Dongsheng Yang wrote:
>>>>
>>>>
>>>> ? 2024/4/24 ??? ?? 12:29, Dan Williams ??:
>>>>> Dongsheng Yang wrote:
>>>>>> From: Dongsheng Yang <dongsheng.yang.linux@gmail.com>
>>>>>>
>>>>>> Hi all,
>>>>>> This patchset introduce cbd (CXL block device). It's based on linux 6.8, and available at:
>>>>>> https://github.com/DataTravelGuide/linux
>>>>>>
>>>>> [..]
>>>>>> (4) dax is not supported yet:
>>>>>> same with famfs, dax device is not supported here, because dax device does not support
>>>>>> dev_dax_iomap so far. Once dev_dax_iomap is supported, CBD can easily support DAX mode.
>>>>>
>>>>> I am glad that famfs is mentioned here, it demonstrates you know about
>>>>> it. However, unfortunately this cover letter does not offer any analysis
>>>>> of *why* the Linux project should consider this additional approach to
>>>>> the inter-host shared-memory enabling problem.
>>>>>
>>>>> To be clear I am neutral at best on some of the initiatives around CXL
>>>>> memory sharing vs pooling, but famfs at least jettisons block-devices
>>>>> and gets closer to a purpose-built memory semantic.
>>>>>
>>>>> So my primary question is why would Linux need both famfs and cbd? I am
>>>>> sure famfs would love feedback and help vs developing competing efforts.
>>>>
>>>> Hi,
>>>> Thanks for your reply, IIUC about FAMfs, the data in famfs is stored in
>>>> shared memory, and related nodes can share the data inside this file system;
>>>> whereas cbd does not store data in shared memory, it uses shared memory as a
>>>> channel for data transmission, and the actual data is stored in the backend
>>>> block device of remote nodes. In cbd, shared memory works more like network
>>>> to connect different hosts.
>>>>
>>>
>>> Couldn't you basically just allocate a file for use as a uni-directional
>>> buffer on top of FAMFS and achieve the same thing without the need for
>>> additional kernel support? Similar in a sense to allocating a file on
>>> network storage and pinging the remote host when it's ready (except now
>>> it's fast!)
>>
>> I'm not entirely sure I follow your suggestion. I guess it means that cbd
>> would no longer directly manage the pmem device, but allocate files on famfs
>> to transfer data. I didn't do it this way because I considered at least a
>> few points: one of them is, cbd_transport actually requires a DAX device to
>> access shared memory, and cbd has very simple requirements for space
>> management, so there's no need to rely on a file system layer, which would
>> increase architectural complexity.
>>
>> However, we still need cbd_blkdev to provide a block device, so it doesn't
>> achieve "achieve the same without the need for additional kernel support".
>>
>> Could you please provide more specific details about your suggestion?
>
> Fundamentally you're shuffling bits from one place to another, the
> ultimate target is storage located on another device as opposed to
> the memory itself. So you're using CXL as a transport medium.
>
> Could you not do the same thing with a file in FAMFS, and put all of
> the transport logic in userland? Then you'd just have what looks like
> a kernel bypass transport mechanism built on top of a file backed by
> shared memory.
>
> Basically it's unclear to me why this must be done in the kernel.
> Performance? Explicit bypass? Some technical reason I'm missing?


In user space, transferring data via FAMFS files poses no problem, but
how do we present this data to users? We cannot expect users to revamp
all their business I/O methods.

For example, suppose a user needs to run a database on a compute node.
As the cloud infrastructure department, we need to allocate a block
storage on the storage node and provide it to the database on the
compute node through a certain transmission protocol (such as iSCSI,
NVMe over Fabrics, or our current solution, cbd). Users can then create
any file system they like on the block device and run the database on
it. We aim to enhance the performance of this block device with cbd,
rather than requiring the business department to adapt their database to
fit our shared memory-facing storage node disks.

This is why we need to provide users with a block device. If it were
only about data transmission, we wouldn't need a block device. But when
it comes to actually running business operations, we need a block
storage interface for the upper layer. Additionally, the block device
layer offers many other rich features, such as RAID.

If accessing shared memory in user space is mandatory, there's another
option: using user space block storage technologies like ublk. However,
this would lead to performance issues as data would need to traverse
back to the kernel space block device from the user space process.

In summary, we need a block device sharing mechanism, similar to what is
provided by NBD, iSCSI, or NVMe over Fabrics, because user businesses
rely on the block device interface and ecosystem.
>
>
> Also, on a tangential note, you're using pmem/qemu to emulate the
> behavior of shared CXL memory. You should probably explain the
> coherence implications of the system more explicitly.
>
> The emulated system implements what amounts to hardware-coherent
> memory (i.e. the two QEMU machines run on the same physical machine,
> so coherency is managed within the same coherence domain).
>
> If there is no explicit coherence control in software, then it is
> important to state that this system relies on hardware that implements
> snoop back-invalidate (which is not a requirement of a CXL 3.x device,
> just a feature described by the spec that may be implemented).

In (5) of the cover letter, I mentioned that cbd addresses cache
coherence at the software level:

(5) How do blkdev and backend interact through the channel?
a) For reader side, before reading the data, if the data in this
channel may be modified by the other party, then I need to flush the
cache before reading to ensure that I get the latest data. For example,
the blkdev needs to flush the cache before obtaining compr_head because
compr_head will be updated by the backend handler.
b) For writter side, if the written information will be read by others,
then after writing, I need to flush the cache to let the other party see
it immediately. For example, after blkdev submits cbd_se, it needs to
update cmd_head to let the handler have a new cbd_se. Therefore, after
updating cmd_head, I need to flush the cache to let the backend see it.


This part of the code is indeed implemented, however, as you pointed
out, since I am currently using qemu/pmem for emulation, the effects of
this code cannot be observed.

Thanx
>
> ~Gregory
> .
>
Re: [PATCH RFC 0/7] block: Introduce CBD (CXL Block Device) [ In reply to ]
On Fri, Apr 26, 2024 at 10:53:43PM +0800, Dongsheng Yang wrote:
>
>
> ? 2024/4/26 ??? ?? 9:48, Gregory Price ??:
> >
> > Also, on a tangential note, you're using pmem/qemu to emulate the
> > behavior of shared CXL memory. You should probably explain the
> > coherence implications of the system more explicitly.
> >
> > The emulated system implements what amounts to hardware-coherent
> > memory (i.e. the two QEMU machines run on the same physical machine,
> > so coherency is managed within the same coherence domain).
> >
> > If there is no explicit coherence control in software, then it is
> > important to state that this system relies on hardware that implements
> > snoop back-invalidate (which is not a requirement of a CXL 3.x device,
> > just a feature described by the spec that may be implemented).
>
> In (5) of the cover letter, I mentioned that cbd addresses cache coherence
> at the software level:
>
> (5) How do blkdev and backend interact through the channel?
> a) For reader side, before reading the data, if the data in this channel
> may be modified by the other party, then I need to flush the cache before
> reading to ensure that I get the latest data. For example, the blkdev needs
> to flush the cache before obtaining compr_head because compr_head will be
> updated by the backend handler.
> b) For writter side, if the written information will be read by others,
> then after writing, I need to flush the cache to let the other party see it
> immediately. For example, after blkdev submits cbd_se, it needs to update
> cmd_head to let the handler have a new cbd_se. Therefore, after updating
> cmd_head, I need to flush the cache to let the backend see it.
>

Flushing the cache is insufficient. All that cache flushing guarantees
is that the memory has left the writer's CPU cache. There are potentially
many write buffers between the CPU and the actual backing media that the
CPU has no visibility of and cannot pierce through to force a full
guaranteed flush back to the media.

for example:

memcpy(some_cacheline, data, 64);
mfence();

Will not guarantee that after mfence() completes that the remote host
will have visibility of the data. mfence() does not guarantee a full
flush back down to the device, it only guarantees it has been pushed out
of the CPU's cache.

similarly:

memcpy(some_cacheline, data, 64);
mfence();
memcpy(some_other_cacheline, data, 64);
mfence()

Will not guarantee that some_cacheline reaches the backing media prior
to some_other_cacheline, as there is no guarantee of write-ordering in
CXL controllers (with the exception of writes to the same cacheline).

So this statement:

> I need to flush the cache to let the other party see it immediately.

Is misleading. They will not see is "immediately", they will see it
"eventually at some completely unknowable time in the future".

~Gregory
Re: [PATCH RFC 0/7] block: Introduce CBD (CXL Block Device) [ In reply to ]
? 2024/4/27 ??? ?? 12:14, Gregory Price ??:
> On Fri, Apr 26, 2024 at 10:53:43PM +0800, Dongsheng Yang wrote:
>>
>>
>> ? 2024/4/26 ??? ?? 9:48, Gregory Price ??:
>>>
>>
>> In (5) of the cover letter, I mentioned that cbd addresses cache coherence
>> at the software level:
>>
>> (5) How do blkdev and backend interact through the channel?
>> a) For reader side, before reading the data, if the data in this channel
>> may be modified by the other party, then I need to flush the cache before
>> reading to ensure that I get the latest data. For example, the blkdev needs
>> to flush the cache before obtaining compr_head because compr_head will be
>> updated by the backend handler.
>> b) For writter side, if the written information will be read by others,
>> then after writing, I need to flush the cache to let the other party see it
>> immediately. For example, after blkdev submits cbd_se, it needs to update
>> cmd_head to let the handler have a new cbd_se. Therefore, after updating
>> cmd_head, I need to flush the cache to let the backend see it.
>>
>
> Flushing the cache is insufficient. All that cache flushing guarantees
> is that the memory has left the writer's CPU cache. There are potentially
> many write buffers between the CPU and the actual backing media that the
> CPU has no visibility of and cannot pierce through to force a full
> guaranteed flush back to the media.
>
> for example:
>
> memcpy(some_cacheline, data, 64);
> mfence();
>
> Will not guarantee that after mfence() completes that the remote host
> will have visibility of the data. mfence() does not guarantee a full
> flush back down to the device, it only guarantees it has been pushed out
> of the CPU's cache.
>
> similarly:
>
> memcpy(some_cacheline, data, 64);
> mfence();
> memcpy(some_other_cacheline, data, 64);
> mfence()
>
> Will not guarantee that some_cacheline reaches the backing media prior
> to some_other_cacheline, as there is no guarantee of write-ordering in
> CXL controllers (with the exception of writes to the same cacheline).
>
> So this statement:
>
>> I need to flush the cache to let the other party see it immediately.
>
> Is misleading. They will not see is "immediately", they will see it
> "eventually at some completely unknowable time in the future".

This is indeed one of the issues I wanted to discuss at the RFC stage.
Thank you for pointing it out.

In my opinion, using "nvdimm_flush" might be one way to address this
issue, but it seems to flush the entire nd_region, which might be too
heavy. Moreover, it only applies to non-volatile memory.

This should be a general problem for cxl shared memory. In theory, FAMFS
should also encounter this issue.

Gregory, John, and Dan, Any suggestion about it?

Thanx a lot
>
> ~Gregory
>
Re: [PATCH RFC 0/7] block: Introduce CBD (CXL Block Device) [ In reply to ]
On Sun, Apr 28, 2024 at 01:47:29PM +0800, Dongsheng Yang wrote:
>
>
> ? 2024/4/27 ??? ?? 12:14, Gregory Price ??:
> > On Fri, Apr 26, 2024 at 10:53:43PM +0800, Dongsheng Yang wrote:
> > >
> > >
> > > ? 2024/4/26 ??? ?? 9:48, Gregory Price ??:
> > > >
> > >
> > > In (5) of the cover letter, I mentioned that cbd addresses cache coherence
> > > at the software level:
> > >
> > > (5) How do blkdev and backend interact through the channel?
> > > a) For reader side, before reading the data, if the data in this channel
> > > may be modified by the other party, then I need to flush the cache before
> > > reading to ensure that I get the latest data. For example, the blkdev needs
> > > to flush the cache before obtaining compr_head because compr_head will be
> > > updated by the backend handler.
> > > b) For writter side, if the written information will be read by others,
> > > then after writing, I need to flush the cache to let the other party see it
> > > immediately. For example, after blkdev submits cbd_se, it needs to update
> > > cmd_head to let the handler have a new cbd_se. Therefore, after updating
> > > cmd_head, I need to flush the cache to let the backend see it.
> > >
> >
> > Flushing the cache is insufficient. All that cache flushing guarantees
> > is that the memory has left the writer's CPU cache. There are potentially
> > many write buffers between the CPU and the actual backing media that the
> > CPU has no visibility of and cannot pierce through to force a full
> > guaranteed flush back to the media.
> >
> > for example:
> >
> > memcpy(some_cacheline, data, 64);
> > mfence();
> >
> > Will not guarantee that after mfence() completes that the remote host
> > will have visibility of the data. mfence() does not guarantee a full
> > flush back down to the device, it only guarantees it has been pushed out
> > of the CPU's cache.
> >
> > similarly:
> >
> > memcpy(some_cacheline, data, 64);
> > mfence();
> > memcpy(some_other_cacheline, data, 64);
> > mfence()
> >

just a derp here, meant to add an explicit clflush(some_cacheline)
between the copy and the mfence. But the result is the same.

> > Will not guarantee that some_cacheline reaches the backing media prior
> > to some_other_cacheline, as there is no guarantee of write-ordering in
> > CXL controllers (with the exception of writes to the same cacheline).
> >
> > So this statement:
> >
> > > I need to flush the cache to let the other party see it immediately.
> >
> > Is misleading. They will not see is "immediately", they will see it
> > "eventually at some completely unknowable time in the future".
>
> This is indeed one of the issues I wanted to discuss at the RFC stage. Thank
> you for pointing it out.
>
> In my opinion, using "nvdimm_flush" might be one way to address this issue,
> but it seems to flush the entire nd_region, which might be too heavy.
> Moreover, it only applies to non-volatile memory.
>

The problem is that the coherence domain really ends at the root
complex, and from the perspective of any one host the data is coherent.

Flushing only guarantees it gets pushed out from that domain, but does
not guarantee anything south of it.

Flushing semantics that don't puncture through the root complex won't
help

>
> This should be a general problem for cxl shared memory. In theory, FAMFS
> should also encounter this issue.
>
> Gregory, John, and Dan, Any suggestion about it?
>
> Thanx a lot
> >
> > ~Gregory
> >
Re: [PATCH RFC 0/7] block: Introduce CBD (CXL Block Device) [ In reply to ]
On 24/04/28 01:47PM, Dongsheng Yang wrote:
>
>
> ? 2024/4/27 ??? ?? 12:14, Gregory Price ??:
> > On Fri, Apr 26, 2024 at 10:53:43PM +0800, Dongsheng Yang wrote:
> > >
> > >
> > > ? 2024/4/26 ??? ?? 9:48, Gregory Price ??:
> > > >
> > >
> > > In (5) of the cover letter, I mentioned that cbd addresses cache coherence
> > > at the software level:
> > >
> > > (5) How do blkdev and backend interact through the channel?
> > > a) For reader side, before reading the data, if the data in this channel
> > > may be modified by the other party, then I need to flush the cache before
> > > reading to ensure that I get the latest data. For example, the blkdev needs
> > > to flush the cache before obtaining compr_head because compr_head will be
> > > updated by the backend handler.
> > > b) For writter side, if the written information will be read by others,
> > > then after writing, I need to flush the cache to let the other party see it
> > > immediately. For example, after blkdev submits cbd_se, it needs to update
> > > cmd_head to let the handler have a new cbd_se. Therefore, after updating
> > > cmd_head, I need to flush the cache to let the backend see it.
> > >
> >
> > Flushing the cache is insufficient. All that cache flushing guarantees
> > is that the memory has left the writer's CPU cache. There are potentially
> > many write buffers between the CPU and the actual backing media that the
> > CPU has no visibility of and cannot pierce through to force a full
> > guaranteed flush back to the media.
> >
> > for example:
> >
> > memcpy(some_cacheline, data, 64);
> > mfence();
> >
> > Will not guarantee that after mfence() completes that the remote host
> > will have visibility of the data. mfence() does not guarantee a full
> > flush back down to the device, it only guarantees it has been pushed out
> > of the CPU's cache.
> >
> > similarly:
> >
> > memcpy(some_cacheline, data, 64);
> > mfence();
> > memcpy(some_other_cacheline, data, 64);
> > mfence()
> >
> > Will not guarantee that some_cacheline reaches the backing media prior
> > to some_other_cacheline, as there is no guarantee of write-ordering in
> > CXL controllers (with the exception of writes to the same cacheline).
> >
> > So this statement:
> >
> > > I need to flush the cache to let the other party see it immediately.
> >
> > Is misleading. They will not see is "immediately", they will see it
> > "eventually at some completely unknowable time in the future".
>
> This is indeed one of the issues I wanted to discuss at the RFC stage. Thank
> you for pointing it out.
>
> In my opinion, using "nvdimm_flush" might be one way to address this issue,
> but it seems to flush the entire nd_region, which might be too heavy.
> Moreover, it only applies to non-volatile memory.
>
> This should be a general problem for cxl shared memory. In theory, FAMFS
> should also encounter this issue.
>
> Gregory, John, and Dan, Any suggestion about it?
>
> Thanx a lot
> >
> > ~Gregory
> >

Hi Dongsheng,

Gregory is right about the uncertainty around "clflush" operations, but
let me drill in a bit further.

Say you copy a payload into a "bucket" in a queue and then update an
index in a metadata structure; I'm thinking of the standard producer/
consumer queuing model here, with one index mutated by the producer and
the other mutated by the consumer.

(I have not reviewed your queueing code, but you *must* be using this
model - things like linked-lists won't work in shared memory without
shared locks/atomics.)

Normal logic says that you should clflush the payload before updating
the index, then update and clflush the index.

But we still observe in non-cache-coherent shared memory that the payload
may become valid *after* the clflush of the queue index.

The famfs user space has a program called pcq.c, which implements a
producer/consumer queue in a pair of famfs files. The only way to
currently guarantee a valid read of a payload is to use sequence numbers
and checksums on payloads. We do observe mismatches with actual shared
memory, and the recovery is to clflush and re-read the payload from the
client side. (Aside: These file pairs theoretically might work for CBD
queues.)

Anoter side note: it would be super-helpful if the CPU gave us an explicit
invalidate rather than just clflush, which will write-back before
invalidating *if* the cache line is marked as dirty, even when software
knows this should not happen.

Note that CXL 3.1 provides a way to guarantee that stuff that should not
be written back can't be written back: read-only mappings. This one of
the features I got into the spec; using this requires CXL 3.1 DCD, and
would require two DCD allocations (i.e. two tagged-capacity dax devices -
one writable by the server and one by the client).

Just to make things slightly gnarlier, the MESI cache coherency protocol
allows a CPU to speculatively convert a line from exclusive to modified,
meaning it's not clear as of now whether "occasional" clean write-backs
can be avoided. Meaning those read-only mappings may be more important
than one might think. (Clean write-backs basically make it
impossible for software to manage cache coherency.)

Keep in mind that I don't think anybody has cxl 3 devices or CPUs yet, and
shared memory is not explicitly legal in cxl 2, so there are things a cpu
could do (or not do) in a cxl 2 environment that are not illegal because
they should not be observable in a no-shared-memory environment.

CBD is interesting work, though for some of the reasons above I'm somewhat
skeptical of shared memory as an IPC mechanism.

Regards,
John
Re: [PATCH RFC 0/7] block: Introduce CBD (CXL Block Device) [ In reply to ]
Dongsheng Yang wrote:
>
>
> ? 2024/4/25 ??? ?? 2:08, Dan Williams ??:
> > Dongsheng Yang wrote:
> >>
> >>
> >> ? 2024/4/24 ??? ?? 12:29, Dan Williams ??:
> >>> Dongsheng Yang wrote:
> >>>> From: Dongsheng Yang <dongsheng.yang.linux@gmail.com>
> >>>>
> >>>> Hi all,
> >>>> This patchset introduce cbd (CXL block device). It's based on linux 6.8, and available at:
> >>>> https://github.com/DataTravelGuide/linux
> >>>>
> >>> [..]
> >>>> (4) dax is not supported yet:
> >>>> same with famfs, dax device is not supported here, because dax device does not support
> >>>> dev_dax_iomap so far. Once dev_dax_iomap is supported, CBD can easily support DAX mode.
> >>>
> >>> I am glad that famfs is mentioned here, it demonstrates you know about
> >>> it. However, unfortunately this cover letter does not offer any analysis
> >>> of *why* the Linux project should consider this additional approach to
> >>> the inter-host shared-memory enabling problem.
> >>>
> >>> To be clear I am neutral at best on some of the initiatives around CXL
> >>> memory sharing vs pooling, but famfs at least jettisons block-devices
> >>> and gets closer to a purpose-built memory semantic.
> >>>
> >>> So my primary question is why would Linux need both famfs and cbd? I am
> >>> sure famfs would love feedback and help vs developing competing efforts.
> >>
> >> Hi,
> >> Thanks for your reply, IIUC about FAMfs, the data in famfs is stored in
> >> shared memory, and related nodes can share the data inside this file
> >> system; whereas cbd does not store data in shared memory, it uses shared
> >> memory as a channel for data transmission, and the actual data is stored
> >> in the backend block device of remote nodes. In cbd, shared memory works
> >> more like network to connect different hosts.
> >>
> >> That is to say, in my view, FAMfs and cbd do not conflict at all; they
> >> meet different scenario requirements. cbd simply uses shared memory to
> >> transmit data, shared memory plays the role of a data transmission
> >> channel, while in FAMfs, shared memory serves as a data store role.
> >
> > If shared memory is just a communication transport then a block-device
> > abstraction does not seem a proper fit. From the above description this
> > sounds similar to what CONFIG_NTB_TRANSPORT offers which is a way for
> > two hosts to communicate over a shared memory channel.
> >
> > So, I am not really looking for an analysis of famfs vs CBD I am looking
> > for CBD to clarify why Linux should consider it, and why the
> > architecture is fit for purpose.
>
> Let me explain why we need cbd:
>
> In cloud storage scenarios, we often need to expose block devices of
> storage nodes to compute nodes. We have options like nbd, iscsi, nvmeof,
> etc., but these all communicate over the network. cbd aims to address
> the same scenario but using shared memory for data transfer instead of
> the network, aiming for better performance and reduced network latency.
>
> Furthermore, shared memory can not only transfer data but also implement
> features like write-ahead logging (WAL) or read/write cache, further
> improving performance, especially latency-sensitive business scenarios.
> (If I understand correctly, this might not be achievable with the
> previously mentioned ntb.)
>
> To ensure we have a common understanding, I'd like to clarify one point:
> the /dev/cbdX block device is not an abstraction of shared memory; it is
> a mapping of a block device (such as /dev/sda) on the remote host.
> Reading/writing to /dev/cbdX is equivalent to reading/writing to
> /dev/sda on the remote host.
>
> This is the design intention of cbd. I hope this clarifies things.

I does, thanks for the clarification. Let me go back and take a another
look now that I undertand that this is a "remote storage target over CXL
memory" solution.
Re: [PATCH RFC 0/7] block: Introduce CBD (CXL Block Device) [ In reply to ]
Dongsheng Yang wrote:
>
>
> ? 2024/4/27 ??? ?? 12:14, Gregory Price ??:
> > On Fri, Apr 26, 2024 at 10:53:43PM +0800, Dongsheng Yang wrote:
> >>
> >>
> >> ? 2024/4/26 ??? ?? 9:48, Gregory Price ??:
> >>>
> >>
> >> In (5) of the cover letter, I mentioned that cbd addresses cache coherence
> >> at the software level:
> >>
> >> (5) How do blkdev and backend interact through the channel?
> >> a) For reader side, before reading the data, if the data in this channel
> >> may be modified by the other party, then I need to flush the cache before
> >> reading to ensure that I get the latest data. For example, the blkdev needs
> >> to flush the cache before obtaining compr_head because compr_head will be
> >> updated by the backend handler.
> >> b) For writter side, if the written information will be read by others,
> >> then after writing, I need to flush the cache to let the other party see it
> >> immediately. For example, after blkdev submits cbd_se, it needs to update
> >> cmd_head to let the handler have a new cbd_se. Therefore, after updating
> >> cmd_head, I need to flush the cache to let the backend see it.
> >>
> >
> > Flushing the cache is insufficient. All that cache flushing guarantees
> > is that the memory has left the writer's CPU cache. There are potentially
> > many write buffers between the CPU and the actual backing media that the
> > CPU has no visibility of and cannot pierce through to force a full
> > guaranteed flush back to the media.
> >
> > for example:
> >
> > memcpy(some_cacheline, data, 64);
> > mfence();
> >
> > Will not guarantee that after mfence() completes that the remote host
> > will have visibility of the data. mfence() does not guarantee a full
> > flush back down to the device, it only guarantees it has been pushed out
> > of the CPU's cache.
> >
> > similarly:
> >
> > memcpy(some_cacheline, data, 64);
> > mfence();
> > memcpy(some_other_cacheline, data, 64);
> > mfence()
> >
> > Will not guarantee that some_cacheline reaches the backing media prior
> > to some_other_cacheline, as there is no guarantee of write-ordering in
> > CXL controllers (with the exception of writes to the same cacheline).
> >
> > So this statement:
> >
> >> I need to flush the cache to let the other party see it immediately.
> >
> > Is misleading. They will not see is "immediately", they will see it
> > "eventually at some completely unknowable time in the future".
>
> This is indeed one of the issues I wanted to discuss at the RFC stage.
> Thank you for pointing it out.
>
> In my opinion, using "nvdimm_flush" might be one way to address this
> issue, but it seems to flush the entire nd_region, which might be too
> heavy. Moreover, it only applies to non-volatile memory.
>
> This should be a general problem for cxl shared memory. In theory, FAMFS
> should also encounter this issue.
>
> Gregory, John, and Dan, Any suggestion about it?

The CXL equivalent is GPF (Global Persistence Flush), not be confused
with "General Protection Fault" which is likely what will happen if
software needs to manage cache coherency for this solution. CXL GPF was
not designed to be triggered by software. It is hardware response to a
power supply indicating loss of input power.

I do not think you want to spend community resources reviewing software
cache coherency considerations, and instead "just" mandate that this
solution requires inter-host hardware cache coherence. I understand that
is a difficult requirement to mandate, but it is likely less difficult
than getting Linux to carry a software cache coherence mitigation.

In some ways this reminds me of SMR drives and the problems those posed
to software where ultimately the programming difficulties needed to be
solved in hardware, not exported to the Linux kernel to solve.