The storage of old article revisions on Wikipedia is taking a few gigs
now, and getting larger.
Occasional alterations made to the old table's structure require a
duplicate temporary table; the downside of Innodb's storage management
is that while its storage pool can grow automatically, it doesn't
shrink! This leaves us with a few gigs of wasted space on the hard drive
that aren't available for other resources -- log files, uploads, etc
which might be needed. Frankly, we're running a little tight right now.
As the table grows, the doubled space requirements of maintenance time
are similarly going to increase. And even if we switch back to a db
backend that allows reclaiming unused space, we need to keep that space
available for when it comes time to add a field or adjust the indexes.
We could store old revisions as binary blobs compressed with gzip
instead of raw text; this should reduce overall storage requirements by
well over 50%. Performance shouldn't be harmed significantly, as old
versions are loaded relatively rarely and saved only once each.
Pro:
* Saves disk space!
* Should reduce maintenance downtime on table alterations, with less
data to copy around.
Con:
* Old revision contents wouldn't be searchable with straight SQL queries
(LIKE etc).
* Slight speed degredation due to compression/decompression; makes the
code a little more complex.
* PHP needs to be recompiled with zlib support; we may have to make it
optional for third-party sites.
* The downloadable dumps may actually get bigger, depending on how
compressors interact.
-- brion vibber (brion @ pobox.com)
now, and getting larger.
Occasional alterations made to the old table's structure require a
duplicate temporary table; the downside of Innodb's storage management
is that while its storage pool can grow automatically, it doesn't
shrink! This leaves us with a few gigs of wasted space on the hard drive
that aren't available for other resources -- log files, uploads, etc
which might be needed. Frankly, we're running a little tight right now.
As the table grows, the doubled space requirements of maintenance time
are similarly going to increase. And even if we switch back to a db
backend that allows reclaiming unused space, we need to keep that space
available for when it comes time to add a field or adjust the indexes.
We could store old revisions as binary blobs compressed with gzip
instead of raw text; this should reduce overall storage requirements by
well over 50%. Performance shouldn't be harmed significantly, as old
versions are loaded relatively rarely and saved only once each.
Pro:
* Saves disk space!
* Should reduce maintenance downtime on table alterations, with less
data to copy around.
Con:
* Old revision contents wouldn't be searchable with straight SQL queries
(LIKE etc).
* Slight speed degredation due to compression/decompression; makes the
code a little more complex.
* PHP needs to be recompiled with zlib support; we may have to make it
optional for third-party sites.
* The downloadable dumps may actually get bigger, depending on how
compressors interact.
-- brion vibber (brion @ pobox.com)