Mailing List Archive: @system and parallel merge speedup

Alex Efros posted on Sun, 21 Oct 2012 16:24:32 +0300 as excerpted:

> Hi!
>
> On Sun, Oct 21, 2012 at 08:02:47AM +0000, Duncan wrote:
>> Bottom line, an empty @system set really does make a noticeable
>> difference in parallel merge handling, speeding up especially
>> --emptytree @world rebuilds but also any general update that has a
>> significant number of otherwise @system packages and deps,
>> dramatically. I'm happy. =:^)
>
> I think "@system first" and "@system not merge in parallel" rules are
> safe to break when you just doing "--emptytree @world" on already
> updated OS because it's only rebuild existing packages, and all packages
> while compiling will see same set of other packages (including same
> versions). But when upgrading multiple packages (including some from
> original @system and some from @world) this probably may result in bugs.

In theory, you're right. In practice, I've not seen it yet, tho being
cautious I'd say it needs at least six months of testing (I've only been
testing it about a month, maybe six weeks) before I can say for sure.
It /was/ something I was a bit concerned about, however.

That was in fact one of the reasons I decided to try it on the netbook's
chroot as well, which hadn't been upgraded in a year and a half. I
figured if it could work reasonably well there, the chances of an
undiscovered real problem were much lower.

However, it /is/ worth noting that as a matter of course, I already often
choose to do some system-critical upgrades (portage, gcc, glibc, openrc,
udev) on their own, before doing the general upgrades, in part so I can
deal with their config file changes and note any problems right away,
with a relatively small changeset to deal with, as opposed to having a
whole slew of updates including critical system package updates happen
all at once, thus making it far more difficult to trace which update
actually broke things.

That's where the years of gentoo experience I originally mentioned comes
in. This isn't going to be as easy for a gentoo newbie for at least two
reasons. First, they're less likely to know what packages really /are/
system critical, and thus are more likely to unmerge them without the
extra unmerge warning a package in the system set gets. (I mentioned
that one in the first post.) Second, spotting critical updates in the
initial --pretend run, knowing which packages it's a good idea to upgrade
first, by themselves, dealing with config file updates, etc, for just
that critical package (and any dependency updates it might pull in),
before going on to the general @world upgrade, probably makes a good bit
of difference in practice, and gentoo newbies are rather less likely to
be able to make that differentiation. (I didn't specifically mention
that one until now.)

> As for "--emptytree @world" speedup, can you provide benchmarked values?
> I mean, only few packages forced to use only one CPU Core while
> compiling.
> So, merging packages in parallel may save some time mostly for doing
> unpack/prepare/configure/install/merge. All of them except configure
> actually do a lot of I/O, which most likely lose a lot in speed instead
> of gain when done in parallel (especially keeping in mind kernel bug
> 12309). So, at a glance time you may win on configure you'll mostly lose
> on I/O, and most of time all your CPU Cores will be loaded anyway while
> compiling, and doing configure in parallel to compiling unlikely save
> some time. This is why I think without actual benchmarking we can't be
> sure how faster it became (if it became faster at all, which is
> questionable).

Good points, and no, I can't easily provide benchmarks, both because of
the recent hardware upgrade here, and because portage itself has been
gradually improving its parallel merging abilities -- a recent update
changed the scheduling algorithm so it starts additional merges much
sooner than it did previously. (See gentoo bug 438650 fixed in portage
2.1.11.29 and 2.2.0_alpha140, both released on Oct 17. That I know about
that hints at another thing I do routinely as an experienced gentooer: I
always read portage's changelog and check out any referenced bugs that
look interesting, before I upgrade portage. To the extent practical
without actually reading the individual git commits, I want to know about
package manager changes that might affect me BEFORE I do that upgrade!)

But, I believe as core-counts rise, you're underestimating the effects of
portage's parallel merging abilities. In particular, a lot of packages
normally in @system (or deps thereof) are relatively small packages such
as grep, patch, sed... where the single-threaded configure step takes a
MUCH larger share of the total package merge time than it does with
larger packages. Similarly, the unpack and prepare phases, plus the
package phase for folks using FEATURES=binpkg, tend to be
single-threaded.[1]

Thus, instead of serializing several dozen small mostly single-threaded
package merges for packages like grep/sed/patch/util-linux/etc, depending
on the --jobs and --load-average numbers you feed to portage, several of
these end up getting done in parallel, with the portage multi-job output
bumping a line every few seconds because it's doing them in parallel,
instead of every minute or so, because it's doing one at a time.

Meanwhile, it should be obvious, but it's worth stating anyway. The
effect gets *MUCH* bigger as the number of cores increases. For a dual-
core, bah, not worth the trouble, as it could cause more problems then it
solves, especially if people are trying to work on other things while
portage is doing its thing in the background. I suspect the break-over
point is either triple-core or quad-core. One of the reasons portage is
getting better lately is because someone's taken an interest that has a
32-core, with a corresponding amount of memory (64 or 128 gig IIRC).

It's worth noting, as I mentioned, that I now have a 6-core, recently
upgraded from a dual-dual-core (4 cores), with a corresponding memory
upgrade, to 16 gigs.

One of the first things I noticed doing emerges was how much more
difficult it was to keep the 6-core actually peaked out to 100% CPU, than
it had been the 4-core. While I suspect there would have been a
difference on the quad-core (as I said I believe the break-over's
probably 3-4 cores), it wasn't a big deal there. Staring at that 6-core
running at 100% on 1-2 cores CPU-freq-maxed at 3.6 GHz, while the other
4-5 cores remained near idle at <20% utilization at CPU-freq-minimum 1.4
GHz... was VERY frustrating. So began my drive to empty @system and get
portage properly scheduling parallel merges for former @system packages
and their deps as well!

For the quad-core plus hyperthreading (thus 8 threads I take it?) you
mention below (4.6 GHz OC, nice! I see stock is 3.4 GHz), the boost from
killing @system forced serialization should definitely make a difference
(unless the hyperthreading doesn't do much for that work load, making it
effectively no better than a non-hyperthreaded quad-core. For my 6-core,
it made a rather big difference, and I guarantee if you had the 32-core
that one of the devs working on improving portage's parallelization has,
you'd be hot on the trail to improve it as well!

> As for me, I found very effective way to speedup emerge is upgrading
> from Core2Duo E6600 to i7-2600K overclocked to 4.6GHz. This speedup
> compilation on my system in 6 times (kernel now compiles in just 1
> minute). And to speedup most other (non-compilation) portage operations
> I use 4GB tmpfs mount on /var/tmp/portage/.

I remember reading about the 1-minute kernel compiles on i7s. Very
impressive.

FWIW, there's a lot of variables to fill in the blank on, before we can
be sure kernel build time comparisons are apples to apples (I had several
more paragraphs written on that, but decided it was a digression too far
for this post so deleted 'em), but AFAIK when I read about it (on phoronix
I believe), he was doing an all-yes config, so building rather more than
a typical customized-config gentooer, but was using a rather fast SSD,
which probably improved his times quite a bit compared to "spinning rust".

But I don't know if his timings included the actual compress (and if so
with what CONFIG_KERNEL_XXX compression option) and I don't believe they
included the actual install, only the build.

That said, a 1-minute all-yes-config kernel build time is impressive
indeed, the envy of many, including me. (OTOH, my fx6100 was on sale for
$100, $109 post-tax. That's lower than pricewatch's $118 lowest quote
(shipped, no tax), and only about 40% of the $273 low quote for an
i7-2600k.)

My build, compress (CONFIG_KERNEL_XZ) and install, runs ~2 minutes
(1:58-2:07, 10+ runs, warm-cache), so yes, even if your build time
doesn't include compress and install, which it might, 1-minute is still
VERY impressive. Tho as I said, my CPU cost ~40% of the going price on
yours, so...

Meanwhile...

I too use and DEFINITELY recommend a tmpfs $PORTAGE_TMPDIR. I'm running
16 gig RAM here, and didn't want to run out of room with parallel builds,
so set a nice roomy 12G tmpfs size.

A $PORTAGE_TMPDIR on tmpfs also reduces the I/O. At least here, the only
time I've had problems, both on the old hardware and on the new, is when
I go into swap. (And on the old hardware I had swap priority= striped
across four disks and 4-way md/raid0, so the kernel could schedule swap-
out vs read-in much better and I didn't see a problem until I hit nearly
half-gig of swap loading at once; the new hardware is only single-disk
ATM, and I see issues starting @ 80 meg or so of swap loading, at once.)
But with 16 gig RAM on the new system, the only time I see it go into
swap is when I run a kernel build with uncapped -j, thus hitting 500+
jobs and close enough to 16 gigs that whether I hit swap or not depends
on what else I've been doing with the system.

Basically, I/O is thus not a problem at all with portage, here, up to the
--jobs=12 --load-average=12 along with MAKEOPTS="-j20 -l15" I normally
run, anyway. On the old system with only six gigs of RAM, if I tried
hard enough I could get portage to hit swap there, but I limited --jobs
and MAKEOPTS until that wasn't an issue, and had no additional problems.

Tho I should mention I also run PORTAGE_NICENESS=19 (and my kernel-build/
install script similarly renices itself to 19 before starting the kernel
build), which puts it in batch-scheduling mode (idle-only scheduling, but
longer timeslices).

If it matters, filesystem is reiserfs, iosched is cfq, drive is sata2/ahci
(amd 990fx/sb950 chipset) 2.5" seagate "spinning rust".

But I definitely agree with $PORTAGE_TMPDIR on tmpfs. It makes a HUGE
difference!

---
[1] Compression parallelism: There are parallel-threaded alternatives to
bzip2, for instance, but they have certain down-sides like decompress
only being parallel where the tarball was compressed with the same
parallel tool, and certain compression buffer nul-fill handling
differences that make them not functionally perfect drop-in replacements.
See the recent discussion on the topic on the gentoo-dev list for
instance.

--
Duncan - List replies preferred. No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master." Richard Stallman