Mailing List Archive

Change layout of distfiles
Hi,

as suggested by Mike in http://bugs.gentoo.org/show_bug.cgi?id=123335,
here's my proposal for changing the layout of the distfiles tree:


This is the current state:

mirror:/storage/gentoo/data/source/distfiles# ls | wc -l
22543
mirror:/storage/gentoo/data/source/distfiles# ls -l ../ | grep distfiles
drwxr-xr-x 3 gentoo gentoo 950272 Mar 6 06:08 distfiles
mirror:/storage/gentoo/data/source/distfiles#


People who want to browse the files "by hand" are usually stumped by the
large output (the directory listing lighttpd creates is currently 4.2MB
in size) and generation of those listings causes an excessive strain on
the server.

Plus the creation/deletion of files doesn't scale too well on
filesystems, which store directory entries in linked lists (ext2/3,
probably the common bsd filesystems), since the list has to be traversed
for each file deleted/created.

Introducing an additional directory hierarchy should fix this, and is
the common solution for this problem for various projects, be it debian
[1], cpan [2], slackware [3], etc.


One migration scenario for a better future:

Create subdirectories named after the first letter of each file and move
the files in their respective directories.

Either sym- or hardlink the files from the current distfiles
root-directory to the specific directory where they reside in. (Check
with the mirror admins first (depending on the chosen linktype) if rsync
hardlink support is enabled or their web/ftp servers allow/follow symlinks)

Adapt the build scripts so that they look for the files in their new
location.

Change the scripts which fetch the files for distfiles so that they save
them under the new location.

Wait a few weeks... (months? years? decades?) until the last user has
updated and/or a clean upgrade-path exists, which doesn't rely on the
old file locations.

Drop the sym/hardlinks.


After the change we'd have (with the current set of files) 63
subdirectories, the largest one containing 1775 files (letter 'g'),
which is a definitive improvement over the current situation.

Full list can be seen at http://mirror.inode.at/gentoo-listing.txt .


best regards,
Michael Renner - admin of gentoo.inode.at/rsync1.at.gentoo.org

[1] http://debian.inode.at/debian/pool/main/
[2] http://www.slackware.at/data/slackware/slackware/
[3] http://cpan.inode.at/modules/by-authors/id/
--
gentoo-dev@gentoo.org mailing list
Re: Change layout of distfiles [ In reply to ]
Michael Renner wrote:
> Hi,
>
> as suggested by Mike in http://bugs.gentoo.org/show_bug.cgi?id=123335,
> here's my proposal for changing the layout of the distfiles tree:

> Introducing an additional directory hierarchy should fix this, and is
> the common solution for this problem for various projects, be it debian
> [1], cpan [2], slackware [3], etc.
>
>
> One migration scenario for a better future:
>
> Create subdirectories named after the first letter of each file and move
> the files in their respective directories.
>
> Either sym- or hardlink the files from the current distfiles
> root-directory to the specific directory where they reside in. (Check
> with the mirror admins first (depending on the chosen linktype) if rsync
> hardlink support is enabled or their web/ftp servers allow/follow symlinks)
>
> Adapt the build scripts so that they look for the files in their new
> location.
>
> Change the scripts which fetch the files for distfiles so that they save
> them under the new location.
>
> Wait a few weeks... (months? years? decades?) until the last user has
> updated and/or a clean upgrade-path exists, which doesn't rely on the
> old file locations.
>
> Drop the sym/hardlinks.
>

Is this plan for server side only distfiles, or do you want
/usr/portage/distfiles/{a-z}/ on the local system as well. If that is
the case the answer is probably no. We've been asked in the past to
implement a DISTFILES_PREFIX type system which would work in a similar
manner, and it really only complicates things. Is there any needed
performance benefit out of the current scheme? Can you give some
numbers as to how much this will help the average user?

I believe the Infrastructure team also doesn't want to change the
layout, but I'll leave it up to them to comment on their own policy ;)

> best regards,
> Michael Renner - admin of gentoo.inode.at/rsync1.at.gentoo.org
>
> [1] http://debian.inode.at/debian/pool/main/
> [2] http://www.slackware.at/data/slackware/slackware/
> [3] http://cpan.inode.at/modules/by-authors/id/
Re: Change layout of distfiles [ In reply to ]
On Mon, Mar 06, 2006 at 07:59:14AM -0500 or thereabouts, Alec Warner wrote:
> I believe the Infrastructure team also doesn't want to change the
> layout, but I'll leave it up to them to comment on their own policy ;)

We'd love to change the layout to something similar to what Michael
proposed. It's the actual changeover process that scares the bejesus out
of us. Many of our mirrors have diligent, professional admins who will
work with us to make the change. Some of our other mirrors don't. It's
easy to say, "well, screw those others, then" except when you consider that
our users will be the ones to feel the pain if a mirror doesn't pick up on
the change (or support the necessary sym/hard linking mojo to make it work)

If we can come up with a seamless, painless transition process, great,
let's make it happen.

--kurt
Re: Change layout of distfiles [ In reply to ]
Alec Warner wrote:

> Is this plan for server side only distfiles, or do you want
> /usr/portage/distfiles/{a-z}/ on the local system as well.

Changing the layout on the server suffices, no need to fiddle around
with more scripts than necessary ;).

> Is there any needed performance benefit out of the current
> scheme? Can you give some numbers as to how much this will
> help the average user?

Listing the directory via proftpd takes the better of 10 minutes on
"cold" caches and consumes around 1 minute of CPU time on an Athlon XP
2800+. With that figures in mind one easily could DoS a mirror-server if
he wants.

best regards,
Michael Renner
--
gentoo-dev@gentoo.org mailing list
Re: Change layout of distfiles [ In reply to ]
Kurt Lieber wrote:

> If we can come up with a seamless, painless transition process, great,
> let's make it happen.

From the _MIRROR_-side using hardlinks should be fine enough, we'd just
have to ensure that every mirror uses -H (preserve hardlinks). And for
the mirrors not using -H this will just result in increased traffic and
diskusage (42GB at the moment, might hurt a bit ;) ). Shouldn't be a
problem though ensuring that every mirror uses -H (and I think they
already do, since we already did hardlink magic when moving old releases
to historical)

I guess the more complicated part will be adapting the ebuild system to
look for/store the files in the new location.

best regards,
Michael
--
gentoo-dev@gentoo.org mailing list
Re: Change layout of distfiles [ In reply to ]
Michael Renner wrote:
> Kurt Lieber wrote:
>
>> If we can come up with a seamless, painless transition process, great,
>> let's make it happen.
>
>
> From the _MIRROR_-side using hardlinks should be fine enough, we'd just
> have to ensure that every mirror uses -H (preserve hardlinks). And for
> the mirrors not using -H this will just result in increased traffic and
> diskusage (42GB at the moment, might hurt a bit ;) ). Shouldn't be a
> problem though ensuring that every mirror uses -H (and I think they
> already do, since we already did hardlink magic when moving old releases
> to historical)
>
> I guess the more complicated part will be adapting the ebuild system to
> look for/store the files in the new location.

Taking the earlier comment ( changing files only on the mirrors ) there
are no portage changes that are technically required. However, you'd
need to change about 10000 ( random number I pulled out of my ass, but
there are many affected ) SRC_URI's to point to the new format, or
produce some sort of hack that translates between the two, and I
wouldn't be to fond of the latter effort, mostly because it would
probably rot in the tree for way too long ;)

And you need to modify policy for placing files on the mirrors, but
thats not a portage problem either; from the portage POV the change is
relatively seamless.

>
> best regards,
> Michael
--
gentoo-dev@gentoo.org mailing list
Re: Change layout of distfiles [ In reply to ]
Alec Warner wrote:
> Taking the earlier comment ( changing files only on the mirrors ) there
> are no portage changes that are technically required. However, you'd
> need to change about 10000 ( random number I pulled out of my ass, but
> there are many affected ) SRC_URI's to point to the new format, or
> produce some sort of hack that translates between the two, and I
> wouldn't be to fond of the latter effort, mostly because it would
> probably rot in the tree for way too long ;)

I don't see how making portage translate mirror://gentoo/${P}.patch.bz2
to http://distfiles.gentoo.org/distfiles/${firstchar}/${P}.patch.bz2 is
worse than changing 10000 SRC_URIs.

> And you need to modify policy for placing files on the mirrors, but
> thats not a portage problem either; from the portage POV the change is
> relatively seamless.

That should be a one-time effort for one person anyway. I guess it's not
too hard to make a script that puts the stuff in
toucan:/space/distfiles-local into the right dir.

--
Kind Regards,

Simon Stelling
Gentoo/AMD64 Member
--
gentoo-dev@gentoo.org mailing list
Re: Change layout of distfiles [ In reply to ]
On Monday 06 March 2006 12:36, Alec Warner wrote:
> Michael Renner wrote:
> > Kurt Lieber wrote:
> >> If we can come up with a seamless, painless transition process, great,
> >> let's make it happen.
> >
> > From the _MIRROR_-side using hardlinks should be fine enough, we'd just
> > have to ensure that every mirror uses -H (preserve hardlinks). And for
> > the mirrors not using -H this will just result in increased traffic and
> > diskusage (42GB at the moment, might hurt a bit ;) ). Shouldn't be a
> > problem though ensuring that every mirror uses -H (and I think they
> > already do, since we already did hardlink magic when moving old releases
> > to historical)
> >
> > I guess the more complicated part will be adapting the ebuild system to
> > look for/store the files in the new location.
>
> Taking the earlier comment ( changing files only on the mirrors ) there
> are no portage changes that are technically required. However, you'd
> need to change about 10000 ( random number I pulled out of my ass, but
> there are many affected ) SRC_URI's to point to the new format, or
> produce some sort of hack that translates between the two, and I
> wouldn't be to fond of the latter effort, mostly because it would
> probably rot in the tree for way too long ;)
>
> And you need to modify policy for placing files on the mirrors, but
> thats not a portage problem either; from the portage POV the change is
> relatively seamless.
>
> > best regards,
> > Michael

Hrm, /me thinks you are missing something there, almost the entire tree
doesn't explicitly state the mirror://gentoo SRC_URI, portage handles that
automatically. That being the case portage would have change so that the
automatic lookup was mirror://gentoo/${firstchar}/. So that is at least one
portage change I can think of being required....

Sure I can still see your point about needing to manually change the packages
that do explicitly state mirror://gentoo in their SRC_URI, but given that you
would have to do the above anyway....

--
Daniel Ostrow
Gentoo Foundation Board of Trustees
Gentoo/{PPC,PPC64,DevRel}
dostrow@gentoo.org
Re: Change layout of distfiles [ In reply to ]
Daniel Ostrow wrote:
> Hrm, /me thinks you are missing something there, almost the entire tree
> doesn't explicitly state the mirror://gentoo SRC_URI, portage handles that
> automatically. That being the case portage would have change so that the
> automatic lookup was mirror://gentoo/${firstchar}/. So that is at least one
> portage change I can think of being required....

Huh? What does it state then? AFAIK ebuilds should ALWAYS use the
mirror:// URI when possible, and since this change is only affecting our
own mirrors, it is always possible.

> Sure I can still see your point about needing to manually change the packages
> that do explicitly state mirror://gentoo in their SRC_URI, but given that you
> would have to do the above anyway....

Huh?? My point was that we shouldn't have to change all those ebuilds
but instead just changing the mirror://gentoo-mapping.

--
Kind Regards,

Simon Stelling
Gentoo/AMD64 Member
--
gentoo-dev@gentoo.org mailing list
Re: Change layout of distfiles [ In reply to ]
Simon Stelling wrote:

> Alec Warner wrote:
>
>> Taking the earlier comment ( changing files only on the mirrors )
>> there are no portage changes that are technically required. However,
>> you'd need to change about 10000 ( random number I pulled out of my
>> ass, but there are many affected ) SRC_URI's to point to the new
>> format, or produce some sort of hack that translates between the two,
>> and I wouldn't be to fond of the latter effort, mostly because it
>> would probably rot in the tree for way too long ;)
>
>
> I don't see how making portage translate
> mirror://gentoo/${P}.patch.bz2 to
> http://distfiles.gentoo.org/distfiles/${firstchar}/${P}.patch.bz2 is
> worse than changing 10000 SRC_URIs.

Better yet, the new portage could download files by trying both kind of
URLs (of course, only during the transition period).
After portage team mark the new portage version stable on all arches and
give the folks a chance to update their systems (6 months perhaps),
infra team could make the transition to the new URLs the same way
they're doing releases -> historical transitions (namely using hardlinks).
Re: Change layout of distfiles [ In reply to ]
On Monday 06 March 2006 13:18, Simon Stelling wrote:
> Daniel Ostrow wrote:
> > Hrm, /me thinks you are missing something there, almost the entire tree
> > doesn't explicitly state the mirror://gentoo SRC_URI, portage handles
> > that automatically. That being the case portage would have change so that
> > the automatic lookup was mirror://gentoo/${firstchar}/. So that is at
> > least one portage change I can think of being required....
>
> Huh? What does it state then? AFAIK ebuilds should ALWAYS use the
> mirror:// URI when possible, and since this change is only affecting our
> own mirrors, it is always possible.

You seem to be missing my point, let's pick an ebuild at random, say
app-admin/cronolog whose SRC_URI="http://cronolog.org/download/${P}.tar.gz",
no the automirror script will need to know to mirror at in /c/ on the
distfiles mirrors, that's outside of portage, however when I emerge cronolog
portage will need to know that the location, on the distfiles mirrors, of
cronolog, is now the equivilent of mirror://gentoo/${firstchar}, taking
distfiles.gentoo.org as an example that would mean
http://distfiles.gentoo.org/distfiles/c/${P}.tar.gz, that means a portage
modification in my book.

> > Sure I can still see your point about needing to manually change the
> > packages that do explicitly state mirror://gentoo in their SRC_URI, but
> > given that you would have to do the above anyway....
>
> Huh?? My point was that we shouldn't have to change all those ebuilds
> but instead just changing the mirror://gentoo-mapping.

And I was saying I agree since the same work has to be done to handle all the
automirrored stuff anyway.

--
Daniel Ostrow
Gentoo Foundation Board of Trustees
Gentoo/{PPC,PPC64,DevRel}
dostrow@gentoo.org
Re: Change layout of distfiles [ In reply to ]
Simon Stelling wrote:
> Daniel Ostrow wrote:
>
>> Hrm, /me thinks you are missing something there, almost the entire
>> tree doesn't explicitly state the mirror://gentoo SRC_URI, portage
>> handles that automatically. That being the case portage would have
>> change so that the automatic lookup was mirror://gentoo/${firstchar}/.
>> So that is at least one portage change I can think of being required....

1925 ebuilds ( with a hacked up SRC_URI checking script )[1]
URI_check.py "mirror://gentoo"

>
>
> Huh? What does it state then? AFAIK ebuilds should ALWAYS use the
> mirror:// URI when possible, and since this change is only affecting our
> own mirrors, it is always possible.
>
>> Sure I can still see your point about needing to manually change the
>> packages that do explicitly state mirror://gentoo in their SRC_URI,
>> but given that you would have to do the above anyway....
>
>
> Huh?? My point was that we shouldn't have to change all those ebuilds
> but instead just changing the mirror://gentoo-mapping.
>

See if we do it the ebuild way we can filter via EAPI. The ebuild has a
EAPI=2 SRC_URI, but portage is only EAPI=0, then the ebuild is
automagically filtered; as opposed to the ebuild failing miserably.
It's getting close to the point where we can finally leverage EAPI to
push features out faster because backwards compatability is maintained (
for portage ). Infra is still screwed essentially doing 2
implementations until such time as the old one can die.

I'd prefer the mirrors not be special cased in a mapping since. URI's
are URI's are URI's...

-Alec Warner


-----------------------------------------
[1] dev.gentoo.org/~antarus/URI_check.py

--
gentoo-dev@gentoo.org mailing list
Re: Change layout of distfiles [ In reply to ]
Hi,

On 3/6/06, Michael Renner <robe@amd.co.at> wrote:
> Hi,
>
> as suggested by Mike in http://bugs.gentoo.org/show_bug.cgi?id=123335,
> here's my proposal for changing the layout of the distfiles tree:
> Introducing an additional directory hierarchy should fix this, and is
> the common solution for this problem for various projects, be it debian
> [1], cpan [2], slackware [3], etc.

Why not have the directory structure follow the package category
structure? E.g. the distfiles for package foo/bar goes into the
directory ${MIRROR_ROOT}/foo/bar?

This should be easy enough to support in Portage, and if applied to
the /usr/portage/distfiles directory too, would solve a few other
problems. It also has the advantage of grouping the distfiles in a
way that users would find natural to browse.

There is the problem of what happens when a package moves, but I think
that's easily solved too.

Best regards,
Stu

--
gentoo-dev@gentoo.org mailing list
Re: Change layout of distfiles [ In reply to ]
Stuart Herbert wrote:

>Why not have the directory structure follow the package category
>structure? E.g. the distfiles for package foo/bar goes into the
>directory ${MIRROR_ROOT}/foo/bar?
>
>This should be easy enough to support in Portage, and if applied to
>the /usr/portage/distfiles directory too, would solve a few other
>problems. It also has the advantage of grouping the distfiles in a
>way that users would find natural to browse.
>
>There is the problem of what happens when a package moves, but I think
>that's easily solved too.
>
>
this has been discussed before.
summary: tarballs could be used by more than one package. this way
you'll manage to increase the disk space demands for our mirrors.
Re: Change layout of distfiles [ In reply to ]
Alin Nastac wrote:
> this has been discussed before.
> summary: tarballs could be used by more than one package. this way
> you'll manage to increase the disk space demands for our mirrors.

This one is about sorting by first letter of filename. It won't solve
multiple different files with same filename, though.

Cheers,
-jkt

--
cd /local/pub && more beer > /dev/mouth
Re: Change layout of distfiles [ In reply to ]
On 3/6/06, Alin Nastac <mrness@gentoo.org> wrote:
> this has been discussed before.
> summary: tarballs could be used by more than one package. this way
> you'll manage to increase the disk space demands for our mirrors.

And you can't hard-link the files into multiple directories because ...?

Best regards,
Stu

--
gentoo-dev@gentoo.org mailing list
Re: Change layout of distfiles [ In reply to ]
Michael Renner wrote:

> Introducing an additional directory hierarchy should fix this, and is
> the common solution for this problem for various projects, be it debian
> [1], cpan [2], slackware [3], etc.
>
>
> One migration scenario for a better future:
>
> Create subdirectories named after the first letter of each file and move
> the files in their respective directories.
>

Splitting the files using only one letter leave some directory with
still too much files in imho.

g 2879
l 2394
p 2049
s 2018

versus

l li 1652
k kd 888
x xf 670
g gn 559

li* (lib) are still a lot, but more manageable.

the total number of files in my mirror directory is 32000, but I don't
delete old files, and I've started some months ago.
--
gentoo-dev@gentoo.org mailing list
Re: Change layout of distfiles [ In reply to ]
On Mon, 6 Mar 2006 19:54:28 +0000 "Stuart Herbert"
<stuart.herbert@gmail.com> wrote:
| On 3/6/06, Alin Nastac <mrness@gentoo.org> wrote:
| > this has been discussed before.
| > summary: tarballs could be used by more than one package. this way
| > you'll manage to increase the disk space demands for our mirrors.
|
| And you can't hard-link the files into multiple directories
| because ...?

...you have to find them first, and because there's a hard link limit
on some filesystems, and because some filesystems don't do hardlinks.

--
Ciaran McCreesh : Gentoo Developer (Wearer of the shiny hat)
Mail : ciaranm at gentoo.org
Web : http://dev.gentoo.org/~ciaranm
Re: Change layout of distfiles [ In reply to ]
Jan Kundrát wrote:

>Alin Nastac wrote:
>
>
>>this has been discussed before.
>>summary: tarballs could be used by more than one package. this way
>>you'll manage to increase the disk space demands for our mirrors.
>>
>>
>
>This one is about sorting by first letter of filename. It won't solve
>multiple different files with same filename, though.
>
>
>
I know what is this about, but Stuart was trying to reopen that old thread.

You can't solve the name conflict in a generic fashion without
increasing required resorces from our mirrors (either disk space or CPU
+ RAM).
Since probability of such conflict is very low, I say better solve one
conflict at a time, by hosting a renamed version of those files on
mirror://gentoo.
Re: Change layout of distfiles [ In reply to ]
On Mon, Mar 06, 2006 at 12:36:22PM -0500, Alec Warner <antarus@gentoo.org> wrote:
> Taking the earlier comment ( changing files only on the mirrors ) there
> are no portage changes that are technically required. However, you'd
> need to change about 10000 ( random number I pulled out of my ass, but
> there are many affected ) SRC_URI's to point to the new format, or
> produce some sort of hack that translates between the two, and I
> wouldn't be to fond of the latter effort, mostly because it would
> probably rot in the tree for way too long ;)

For the time being, whats stopping us from doing something like the
following on the mirrors?

for i in `find . -type f`; do
dir=${i:2:1};
// I guess we REALLY want case sensitivity, but thats not for me
// to decide.
dir=`echo ${dir} | tr [:upper:] [:lower:]`
mkdir -p ${dir};
mv ${i} ${dir};
ln ${dir}/${i:2} ${i};
done

> And you need to modify policy for placing files on the mirrors, but
> thats not a portage problem either; from the portage POV the change is
> relatively seamless.

Modifying the mirror code to do something like the above shouldn't be
complicated at all.

--
Role: Gentoo Linux Kernel Lead
Gentoo Linux: http://www.gentoo.org
Public Key: gpg --recv-keys 9C745515
Key fingerprint: A0AF F3C8 D699 A05A EC5C 24F7 95AA 241D 9C74 5515
Re: Change layout of distfiles [ In reply to ]
On Mon, 6 Mar 2006 13:45:01 -0500,
Daniel Ostrow <dostrow@gentoo.org> wrote:

> portage will need to know that the location, on the distfiles
> mirrors, of cronolog, is now the equivilent of
> mirror://gentoo/${firstchar}

And what about the "local" mirror type, that one can define in
/etc/portage/mirrors: will it be assumed that files are stored with a
first-char prefix or not?
I would say no, because i think the most common usage is to share the
$DISTDIR of one machine (hence without prefix) over LAN, but i'm not
really sure. Maybe some people also use this one for full mirrors,
based on the official ones (hence with prefix).

--
TGL.
--
gentoo-dev@gentoo.org mailing list
Re: Change layout of distfiles [ In reply to ]
On Mon, 6 Mar 2006 16:02:06 +0000
Kurt Lieber <klieber@gentoo.org> wrote:

> On Mon, Mar 06, 2006 at 07:59:14AM -0500 or thereabouts, Alec Warner
> wrote:
> > I believe the Infrastructure team also doesn't want to change the
> > layout, but I'll leave it up to them to comment on their own
> > policy ;)
>
> We'd love to change the layout to something similar to what Michael
> proposed. It's the actual changeover process that scares the bejesus
> out of us. Many of our mirrors have diligent, professional admins
> who will work with us to make the change. Some of our other mirrors
> don't. It's easy to say, "well, screw those others, then" except
> when you consider that our users will be the ones to feel the pain if
> a mirror doesn't pick up on the change (or support the necessary
> sym/hard linking mojo to make it work)
>
> If we can come up with a seamless, painless transition process, great,
> let's make it happen.

As long as the mirrors provide the files at the current address (via
hardlinks, redirects or rewrite magic doesn't matter) I don't see an
issue from the portage side, of course if the mirror script uses
portage we have to take a look at that. But such a change would be an
internal infra thing more or less.
I'm very opposed however to any client side changes, not only for
transitioning issues but also as this would require special casing code
for gentoo mirrors, both in mirror:// and GENTOO_MIRRORS handling, and
as TGL has mentioned there are some cases where we can't be sure if we
have a mirror with this new or a "traditional" structure. If at all
this would need a generic solution, something that's probably a bit too
complex for a (minor?) performance issue.

Marius

--
Public Key at http://www.genone.de/info/gpg-key.pub

In the beginning, there was nothing. And God said, 'Let there be
Light.' And there was still nothing, but you could see a bit better.