Mailing List Archive

digest reorganization and enhancements
Hi,

This mail was first sent to dev-portage so it's written for that
audience, but it should be understandable for normal devs as well ;)

Short summary: current portage versions won't be able to handle any
modification to the digest format so we have to find a different way if
we want support for SHA1 or other algorithms.


And now the more detailed mail:

As was discussed again on -dev recently we need more digest algorithms
for file verification. One way that would be halfway compatible would be
to add additional lines use the same syntax as for the current md5
checksums to the digests and Manifests. However that means a lot of
redundancy as for each additional algorithm the filename and filesize
would be duplicated. It's also not trivial to do as there are several
functions dealing with digests and they all parse them a bit different
(I tried to add SHA1 support for digests and Manifests, took me about an
hour before I gave up). Also as soon as we add non-MD5 lines to digests
all currently released portage versions will blow up (as they will treat
the provided hash as a MD5 value, call it a bug if you want).

Instead I suggest we completely reorganize the digest system from
scratch by unifying the digests and the manifest files. As you all know
our tree is getting bigger and bigger with no end in sight. That
combined with the usual filesystem overhead causes a lot of wasted space
on many systems. By unifying the digests with the Manifests we could
kill >15.000 very small files at once (in the long run, this would
require compatible portage versions for all users).

As for the new syntax, it should allow us to add new digest algorithms
to portage without changing the syntax. My current idea would be that
for each file in the tree and in SRC_URI we have a line specifying:
- the filename
- the filesize
- n digests (consisting of algorithmname and the checksum)
To maintain compability and support future enhancements each of these
lines has to be prefixed with a (set of) keyword(s) (FILE or DIGEST or
SRC_URI,EBULD,AUXFILE).
Example lines could be:

SRC_URI portage-2.0.51_rc7.tar.bz2 274572 MD5 1234 SHA1 abcd RMD160 9876
EBUILD portage-2.0.51_rc7.ebuild 11806 MD5 xyz SHA1 fifteen

(using fake checksums for readability).

Maybe the system can also be extended to incorporate GLEP 25 without
adding a ton of new files, I'd need some input from Brian on that issue.

The biggest problem for this proposal is of course compability, a rough
transition plan could be:
- keep digests as they are now
- add the new format to Manifests (additional to the current MD5 lines)
- support the new format in 2.0.52 (use it optionally for verification)
- use it for verification in 2.1 by default (and drop support for the
old system)
- exclude the old digests from `emerge --sync` in 2.1

And finally a summarizing list of reasons for the format:
- keep all checksums of a package in one place
- removes one level of indirection for signing
- digest generation currently recreates the Manifest anyway
- removing files from the tree
- allows for easy addition of new digest algorithms
- any syntax modification to the current digest files brings compability
problems with all currently existing portage versions while Manifest
changes do not
- potential to discover file collisions easier (currently you can have
the same file in two digests with different checksums, not a real
problem yet though)
- removes redundancy for common files

Let the discussions begin.

Marius
Re: digest reorganization and enhancements [ In reply to ]
This entire issue changes direction based on how hard and from
where the pushes come. There are 3 general directions, as I see
it.

1. Unify Manifest & digests.
2. Create a new digest file.
3. Break backwards compatibility.

I'd prefer 3. I'm pretty sure it's the first time I've
ever suggested that or had it happen since I started
maintaining portage. :-/

> SRC_URI portage-2.0.51_rc7.tar.bz2 274572 MD5 1234 SHA1 abcd RMD160 9876
> EBUILD portage-2.0.51_rc7.ebuild 11806 MD5 xyz SHA1 fifteen
>
> (using fake checksums for readability).

My suggestion involves a break in compatibility after a expedited
transition period. Kind of a bummer really.

A primary issue with unified digest/Manifest is that you force
Maintainers to verify tarballs and be liable for them. With the
seperated digest/Manifest we can begin signing both the Manifest
and the digests allowing the introducer of the package to be
liable for the tarballs and leave the Manifest, Ebuilds, and
supporting files to be verified by the {arch,package} Maintainer.

With signatures moving along slowly, we have the capability
to revoke a signature and all potentially infected packages.
If Arch maintainers are constantly removing/changing the
signatures on a digest, you will have no obvious knowledge
of which files they have touched a particular file.

> Maybe the system can also be extended to incorporate
> GLEP 25 without adding a ton of new files, I'd need some
> input from Brian on that issue.

I'd rather this be an external sync tool than have portage
need to deal with it. Supporting it via an uncompressed MD5
is easily done though. We're breaking compat with my suggestion
anyway.

> And finally a summarizing list of reasons for the format:
> - keep all checksums of a package in one place
> - removes one level of indirection for signing

These are good and bad depending on how you look at it.

> - digest generation currently recreates the Manifest anyway
> - removing files from the tree

You lose direct info on who originally verified the tarballs.

> - allows for easy addition of new digest algorithms
> - any syntax modification to the current digest files brings
> compability problems with all currently existing portage
> versions while Manifest changes do not

Sadly, I missed fixing digests when I did the Manifest code.

The fix is simply a matter of time. But for all realistic,
time-constrained, viewpoints, it is accurate. It's a one-time
adjustment of the digest format.

SHA1 and many others can be added to the digests after 2.0.51
comes out. It supports digests of varying forms but there
should be a reasonable delay before implementation as the
impact on the slow-to-update user will be manual. A script
could be provided, I imagine. The rescue documentation still
applies.

> - potential to discover file collisions easier (currently
> you can have the same file in two digests with different
> checksums, not a real problem yet though)

Tools exist for detecting this. md5_check in bin/.
They should probably be integrated into repoman.


So...
The overall scheme:

1. Transition quickly into 2.0.51.
2. Announce the strong-recommendation/need to update.
3. Post new rescue tarballs.
4. Wait 2-4 weeks.
5. Announce again... ?
6. Enable portage's SHA1 creation.
7. Break compat in the tree.
8. Hopefully never do that again.

--NJ
re: digest reorganization and enhancements (from 2004) [ In reply to ]
It seems like compatibility could be maintained by managing the
transition like this:

(1) new repoman writes both the digests and new-style manifest.
Of course, this is redundant, but that's ok... it's all about
transition.

(2) new portage uses rsync exclusions to ignore the digests since
it only needs the new-style manifest:
rsync --exclude '**/files/digest*' --delete-excluded

(3) users get immediate benefit. we can keep running in this mode
as long as necessary before completely deprecating the digest
files

--
Aron Griffis
Gentoo Linux Developer
re: digest reorganization and enhancements (from 2004) [ In reply to ]
Aron Griffis posted <20050707141954.GA20052@kaf.zko.hp.com>, excerpted
below, on Thu, 07 Jul 2005 10:19:54 -0400:

> (2) new portage uses rsync exclusions to ignore the digests since
> it only needs the new-style manifest:
> rsync --exclude '**/files/digest*' --delete-excluded

This wouldn't delete locally rsync-excluded files/dirs, I hope. Rather
than use the default package and source dirs, I have them as pkg and src,
rsync-excluded so they don't get killed. I'd be very unhappy if they got
killed anyway!

--
Duncan - List replies preferred. No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master." Richard Stallman in
http://www.linuxdevcenter.com/pub/a/linux/2004/12/22/rms_interview.html


--
gentoo-dev@gentoo.org mailing list