Mailing List Archive

A memory map based data persistence and startup speedup approach
Hi folks, as illustrated in faster-cpython#150 [1], we have implemented a mechanism that supports data persistence of a subset of python date types with mmap, therefore can reduce package import time by caching code object. This could be seen as a more eager pyc format, as they are for the same purpose, but our approach try to avoid [de]serialization. Therefore, we get a speedup in overall python startup by ~15%.

Currently, we’ve made it a third-party library and have been working on open-sourcing.

Our implementation (whose non-official name is “pycds”) mainly contains two parts:
importlib hooks, this implements the mechanism to dump code objects to an archive and a `Finder` that supports loading code object from mapped memory.
Dumping and loading (subset of) python types with mmap. In this part, we deal with 1) ASLR by patching `ob_type` fields; 2) hash seed randomization by supporting only basic types who don’t have hash-based layout (i.e. dict is not supported); 3) interned string by re-interning strings while loading mmap archive and so on.

After pycds has been installed, complete workflow of our approach includes three parts:
Record name of imported packages to heap.lst, `PYCDSMODE=TRACE PYCDSLIST=heap.lst python run.py`
Dump memory archive of code objects of imported packages, this step does not involve the python script, `PYCDSMODE=DUMP PYCDSLIST=heap.lst PYCDSARCHIVE=heap.img python`
Run other python processes with created archive, `PYCDSMODE=SHARE PYCDSARCHIVE=heap.img python run.py`

We could even make use of immortal objects if PEP 683 [2] was accepted, which could give CDS more performance improvements. Currently, any archived object is virtually immortal, we add rc by 1 to who has been copied to the archive to avoid being deallocated. However, without changes to CPython, rc fields of archived objects will still be updated, therefore have extra footprint due to CoW.

More background and detailed implementation could be found at [1].
We think it could be an effective way to improve python’s startup performance, and could even do more like sharing large data between python instances.
As suggested in python-ideas [3], we posted this here, looking for questions/suggestions to the overall design and workflow, we also welcome code reviews after we get our lawyers happy and can publish the code.

Best,
Yichen Yan
Alibaba Compiler Group

[1] “Faster startup -- Share code objects from memory-mapped file”, https://github.com/faster-cpython/ideas/discussions/150
[2] PEP 683: "Immortal Objects, Using a Fixed Refcount" (draft), https://mail.python.org/archives/list/python-dev@python.org/message/TPLEYDCXFQ4AMTW6F6OQFINSIFYBRFCR/
[3] [Python-ideas] "A memory map based data persistence and startup speedup approach", https://mail.python.org/archives/list/python-ideas@python.org/thread/UKEBNHXYC3NPX36NS76LQZZYLRA4RVEJ/
Re: A memory map based data persistence and startup speedup approach [ In reply to ]
(belated follow-up as I noticed there hadn't been a reply on list yet, just
the previous feedback on the faster-cpython ticket)

On Mon, 21 Feb 2022, 6:53 pm Yichen Yan via Python-Dev, <
python-dev@python.org> wrote:

>
> Hi folks, as illustrated in faster-cpython#150 [1], we have implemented a
> mechanism that supports data persistence of a subset of python date types
> with mmap, therefore can reduce package import time by caching code object.
> This could be seen as a more eager pyc format, as they are for the same
> purpose, but our approach try to avoid [de]serialization. Therefore, we get
> a speedup in overall python startup by ~15%.
>

This certainly sounds interesting!


> Currently, we’ve made it a third-party library and have been working on
> open-sourcing.
>
>
> Our implementation (whose non-official name is “pycds”) mainly contains
> two parts:
>
> - importlib hooks, this implements the mechanism to dump code objects
> to an archive and a `Finder` that supports loading code object from mapped
> memory.
> - Dumping and loading (subset of) python types with mmap. In this
> part, we deal with 1) ASLR by patching `ob_type` fields; 2) hash seed
> randomization by supporting only basic types who don’t have hash-based
> layout (i.e. dict is not supported); 3) interned string by re-interning
> strings while loading mmap archive and so on.
>
> I assume the files wouldn't be portable across architectures, so does the
cache file naming scheme take that into account?

(The idea is interesting regardless of whether it produces arch-specific
files - kind of a middle ground between portable serialisation based pycs
and fully frozen modules)

Cheers,
Nick.


>
>
Re: A memory map based data persistence and startup speedup approach [ In reply to ]
?
Hi Nick,

?Sorry for the late reply, and thanks for the feedback!

We’ve been working on publishing the package, and the first version is available at https://github.com/alibaba/code-data-share-for-python/, with user guide and some statistics (TL;DR: ~15% speedup in startup).
We welcome code review, comments or questions.

> I assume the files wouldn't be portable across architectures

That’s true, this file is basically a snapshot of part of the CPython heap that could be shared between processes.

> so does the cache file naming scheme take that into account?

Currently no, this file is intended to be generated on demand (rather than generating a huge archive from all the third-party packages installed). Thus the file itself and the name should be managed by user.

> (The idea is interesting regardless of whether it produces arch-specific files - kind of a middle ground between portable serialisation based pycs and fully frozen modules)

I think our package could be the substitution of the frozen module mechanism for third-party packages — while builtin modules can be compiled to C code, code-data-share could automatically create a similar file that requires no compilation / deserialization.
Actually we do have a POC which is integrated with CPython and can speedup importing builtin modules, but after make it third-party package, there’s not much we can do to the builtins, so freeze and deep-freeze is quite exciting to us.

Best,
Yichen

> On Mar 20, 2022, at 23:26, Nick Coghlan <ncoghlan@gmail.com> wrote:
> ?
> (belated follow-up as I noticed there hadn't been a reply on list yet, just the previous feedback on the faster-cpython ticket)
>
> On Mon, 21 Feb 2022, 6:53 pm Yichen Yan via Python-Dev, <python-dev@python.org> wrote:
>>
>> Hi folks, as illustrated in faster-cpython#150 [1], we have implemented a mechanism that supports data persistence of a subset of python date types with mmap, therefore can reduce package import time by caching code object. This could be seen as a more eager pyc format, as they are for the same purpose, but our approach try to avoid [de]serialization. Therefore, we get a speedup in overall python startup by ~15%.
>
>
> This certainly sounds interesting!
>
>>
>> Currently, we’ve made it a third-party library and have been working on open-sourcing.
>>
>> Our implementation (whose non-official name is “pycds”) mainly contains two parts:
>> importlib hooks, this implements the mechanism to dump code objects to an archive and a `Finder` that supports loading code object from mapped memory.
>> Dumping and loading (subset of) python types with mmap. In this part, we deal with 1) ASLR by patching `ob_type` fields; 2) hash seed randomization by supporting only basic types who don’t have hash-based layout (i.e. dict is not supported); 3) interned string by re-interning strings while loading mmap archive and so on.
>
> I assume the files wouldn't be portable across architectures, so does the cache file naming scheme take that into account?
>
> (The idea is interesting regardless of whether it produces arch-specific files - kind of a middle ground between portable serialisation based pycs and fully frozen modules)
>
> Cheers,
> Nick.
>