Mailing List Archive

Compressing the old table
The storage of old article revisions on Wikipedia is taking a few gigs
now, and getting larger.

Occasional alterations made to the old table's structure require a
duplicate temporary table; the downside of Innodb's storage management
is that while its storage pool can grow automatically, it doesn't
shrink! This leaves us with a few gigs of wasted space on the hard drive
that aren't available for other resources -- log files, uploads, etc
which might be needed. Frankly, we're running a little tight right now.

As the table grows, the doubled space requirements of maintenance time
are similarly going to increase. And even if we switch back to a db
backend that allows reclaiming unused space, we need to keep that space
available for when it comes time to add a field or adjust the indexes.


We could store old revisions as binary blobs compressed with gzip
instead of raw text; this should reduce overall storage requirements by
well over 50%. Performance shouldn't be harmed significantly, as old
versions are loaded relatively rarely and saved only once each.

Pro:
* Saves disk space!
* Should reduce maintenance downtime on table alterations, with less
data to copy around.

Con:
* Old revision contents wouldn't be searchable with straight SQL queries
(LIKE etc).
* Slight speed degredation due to compression/decompression; makes the
code a little more complex.
* PHP needs to be recompiled with zlib support; we may have to make it
optional for third-party sites.
* The downloadable dumps may actually get bigger, depending on how
compressors interact.

-- brion vibber (brion @ pobox.com)
Re: Compressing the old table [ In reply to ]
> The storage of old article revisions on Wikipedia is taking a few gigs
> now, and getting larger.

About how much are we talking here? Space is cheap.

> We could store old revisions as binary blobs compressed with gzip
> instead of raw text

If we do this, we should not do this on the last n revisions, so that
recent changes are still quickly available for diffs. I suggest n=5.

But yes, I was afraid we would be running into this problem sooner or
later when I first realized that we were, in fact, storing every version.
Some 30k articles are edited dozens of times in edit wars. We could, of
course, also use diffs+snapshots every x revisions.

I'm also thinking of a good way to implement my "merge subsequent edits by
one user" proposal to drastically reduce the number of edits.

Regards,

Erik
Re: Compressing the old table [ In reply to ]
This should be less of an issue when we separate the webserving and
the database serving...

Jason

Brion Vibber wrote:

> The storage of old article revisions on Wikipedia is taking a few gigs
> now, and getting larger.
>
> Occasional alterations made to the old table's structure require a
> duplicate temporary table; the downside of Innodb's storage management
> is that while its storage pool can grow automatically, it doesn't
> shrink! This leaves us with a few gigs of wasted space on the hard drive
> that aren't available for other resources -- log files, uploads, etc
> which might be needed. Frankly, we're running a little tight right now.
>
> As the table grows, the doubled space requirements of maintenance time
> are similarly going to increase. And even if we switch back to a db
> backend that allows reclaiming unused space, we need to keep that space
> available for when it comes time to add a field or adjust the indexes.
>
>
> We could store old revisions as binary blobs compressed with gzip
> instead of raw text; this should reduce overall storage requirements by
> well over 50%. Performance shouldn't be harmed significantly, as old
> versions are loaded relatively rarely and saved only once each.
>
> Pro:
> * Saves disk space!
> * Should reduce maintenance downtime on table alterations, with less
> data to copy around.
>
> Con:
> * Old revision contents wouldn't be searchable with straight SQL queries
> (LIKE etc).
> * Slight speed degredation due to compression/decompression; makes the
> code a little more complex.
> * PHP needs to be recompiled with zlib support; we may have to make it
> optional for third-party sites.
> * The downloadable dumps may actually get bigger, depending on how
> compressors interact.
>
> -- brion vibber (brion @ pobox.com)



--
"Jason C. Richey" <jasonr@bomis.com>
Re: Compressing the old table [ In reply to ]
Brion Vibber wrote:
> The storage of old article revisions on Wikipedia is taking a few gigs
> now, and getting larger.
>
> Occasional alterations made to the old table's structure require a
> duplicate temporary table; the downside of Innodb's storage management

<SNIP>

> * PHP needs to be recompiled with zlib support; we may have to make it
> optional for third-party sites.
>
> -- brion vibber (brion @ pobox.com)

If this were not optional it would be a serrious blow to some other
wikipedia based sites (not that there are many). There are a lot of
webhosting companies that will not want to change their setups just so a
few customers can continue to run the latest version of PediaWiki.

Tobin Richard