Mailing List Archive: an efficient idea for an alternative portage synchronisation

tl;dr - i'm suggesting a new file syncing protocol
for portage syncing. details of this one is in
section 2.

1. background
-------------
rsync needs to read all files in order to compare
them. this is too expensive and doesn't scale as
portage's tree grows in size..

on the other hand, git gets away with this, by
maintaining a history of edits. so git doesn't
need to compare all files, instead it walks
through the history.

but git has another issue: the history getting
too big. this causes:
- `git clone` to needlessly take too long, as
many old histories become irrelevant as they
get fully overwridden by newer ones.
- this also causes `git pull` to be slower
than needed, as the history is not ideally
compressed.
- plus, the disk space that's wasted for
histories.

2. new protocol
---------------
to solve issues above, i think the ideal solution
is this protocol:
- each history is a number representing a
logical clock. 1st history is 0, 2nd is 1,
etc.
- the server maintains a list of N past many
histories of the portage tree.
- when a client requests to update its portage
tree, it tells the server its current
history. e.g. say client is currently
located in logical time 1234567.
- the server is maintaining only the past N
histories:
- if 1234567 is behind those maintained N
ones, then the server sends a full
portage tree from scratch.
- if 1234567 is within those maintained N
ones, then the server has two options:
(1) either send all changes since
1234567, as they happened
historically. this is a bad idea.
no good reason for it.

(2) better: the server can send the
compressed histories. compressed
histories are done once, and
cached, in a scalable way. the
cache itself is incremental, so
updating the cache is cheap
(details section 2.2.).

e.g. if there are 5000 histories
that the client lacks since time
1234567, then there is a chance
that many of the changes are just
a waste of time. e.g. add a file,
then delete the same file, then
add a different file again. so
why not just lie about the
history, and send the last file,
escaping ones int he middle? same
can be thought about diffs to code
blocks.

2.1. properties of this new protocol
------------------------------------
so this new protocol has these properties:
- unlike rsync, it doesn't need to compare all files
individually.
- unlike git, the history doesn't grow on the
client. history remains only a single
number representing a logical clock.
- the history on the server is limited to N
past entries. no devs will cry, because
this is not a code collaboration app, but
simply a file synchronisation app to replace
rsync. so the admins are free to set N as
small as they please, without worrying about
harming collaborating devs.
- server has the option to compress histories
to clients, and these histories are
cacheable for more performance.

2.2. how it will feel to admins/devs
------------------------------------
- the devs simply commit their changes to the
portage tree via git.
- the git server will have hooks to execute an
external command for this new protocol, that
will calculate all diffs necessary in order
to build a new history.

e.g. if current history is 30000, and a dev
makes a new commit via git, then the git
hooks will execute the external command to
calculate the diff for the affected files by
the git commit, such that history 30001 is
created.

the hooked external command will also see if
it can compress the histories, for the past
M many entries since 30001.

so that clients that live in time 30001-M,
who ask for 30001, can get the compressed
history instead of raw actual histories from
30001-m to 30001.

ty,
cm.

On Fri, Jun 18, 2021, 07:10 caveman ?????? ????????? ??? <
toraboracaveman@protonmail.com> wrote:

> tl;dr - i'm suggesting a new file syncing protocol
> for portage syncing. details of this one is in
> section 2.
>
>
> 1. background
> -------------
> rsync needs to read all files in order to compare
> them. this is too expensive and doesn't scale as
> portage's tree grows in size..
>
> on the other hand, git gets away with this, by
> maintaining a history of edits. so git doesn't
> need to compare all files, instead it walks
> through the history.
>
> but git has another issue: the history getting
> too big. this causes:
> - `git clone` to needlessly take too long, as
> many old histories become irrelevant as they
> get fully overwridden by newer ones.
> - this also causes `git pull` to be slower
> than needed, as the history is not ideally
> compressed.
> - plus, the disk space that's wasted for
> histories.
>
>
> 2. new protocol
> ---------------
> to solve issues above, i think the ideal solution
> is this protocol:
> - each history is a number representing a
> logical clock. 1st history is 0, 2nd is 1,
> etc.
> - the server maintains a list of N past many
> histories of the portage tree.
> - when a client requests to update its portage
> tree, it tells the server its current
> history. e.g. say client is currently
> located in logical time 1234567.
> - the server is maintaining only the past N
> histories:
> - if 1234567 is behind those maintained N
> ones, then the server sends a full
> portage tree from scratch.
> - if 1234567 is within those maintained N
> ones, then the server has two options:
> (1) either send all changes since
> 1234567, as they happened
> historically. this is a bad idea.
> no good reason for it.
>
> (2) better: the server can send the
> compressed histories. compressed
> histories are done once, and
> cached, in a scalable way. the
> cache itself is incremental, so
> updating the cache is cheap
> (details section 2.2.).
>
> e.g. if there are 5000 histories
> that the client lacks since time
> 1234567, then there is a chance
> that many of the changes are just
> a waste of time. e.g. add a file,
> then delete the same file, then
> add a different file again. so
> why not just lie about the
> history, and send the last file,
> escaping ones int he middle? same
> can be thought about diffs to code
> blocks.
>
> 2.1. properties of this new protocol
> ------------------------------------
> so this new protocol has these properties:
> - unlike rsync, it doesn't need to compare all files
> individually.
> - unlike git, the history doesn't grow on the
> client. history remains only a single
> number representing a logical clock.
> - the history on the server is limited to N
> past entries. no devs will cry, because
> this is not a code collaboration app, but
> simply a file synchronisation app to replace
> rsync. so the admins are free to set N as
> small as they please, without worrying about
> harming collaborating devs.
> - server has the option to compress histories
> to clients, and these histories are
> cacheable for more performance.
>
>
> 2.2. how it will feel to admins/devs
> ------------------------------------
> - the devs simply commit their changes to the
> portage tree via git.
> - the git server will have hooks to execute an
> external command for this new protocol, that
> will calculate all diffs necessary in order
> to build a new history.
>
> e.g. if current history is 30000, and a dev
> makes a new commit via git, then the git
> hooks will execute the external command to
> calculate the diff for the affected files by
> the git commit, such that history 30001 is
> created.
>
> the hooked external command will also see if
> it can compress the histories, for the past
> M many entries since 30001.
>
> so that clients that live in time 30001-M,
> who ask for 30001, can get the compressed
> history instead of raw actual histories from
> 30001-m to 30001.
>
> ty,
> cm
>

It seems like you are almost asking for git's --clone-depth and
--sync-depth flags.

Its not an exact match for your proposal but its very close.

>