Mailing List Archive

Conserving Storage Space and Removing History
In a message dated 6/21/2004 5:02:01 PM Eastern Standard Time,
brion@pobox.com writes:

> Note that currently we don't have diff-based storage; when you make a
> change to a page the entire previous revision is stored in whole.
> (Consider enabling $wgCompressOld if you have zlib support in PHP; this
> will reduce old text requirements by roughly half.)
>
> -- brion vibber (brion @ pobox.com)

Currently, my group's wiki is small. There are a few of us actively
contributing right now, but that will probably change soon. The handful of volunteers
that have been putting in content have also been learning the way of the wiki
as they do so, making multiple edits on some rather lengthy articles, and
innocently eating up storage space.

At the moment, our wiki is restricted to only registered users being able to
contribute and only the sysop can create a registered user account.

We had attempted to research the wiki's overhead requirements in making a
judgment as to whether or not to buy more disk space from our provider. During
the investigation of overhead storage requirements, we used the 'wikipedia'
statistics and charts on space. It never occurred to us that 'wikipedia' was
storing full copies of all versions of an article based on the 590MB May 22, 2004
number and considering the high number of articles the db had. We must have
been reading the wrong statistics.

Do the 'wikipedia' administrators remove history from their wiki in order to
preserve space? If so, how is this done? Is there some sort of 'export only
the lastest version of each article, etc.' option, clear the db, and then import
the lastest version back?

Our administrator has set the "$wgCompressRevisions = true;" since your
message (above) -- will that take care of only the revisions since the flag was
turned on or will there be compression of the previous revisions as well?

I appreciate everyone's patience in this. I'm sort of the go-between right
now. Hopefully our administrator will come online with this list and she can
pose the questions more 'technically'. :)

Our versions:
MediaWiki: 1.3.0beta2
PHP: 4.3.4 (apache)
MySQL: 4.0.18

Take care,

Debi
Re: Conserving Storage Space and Removing History [ In reply to ]
AlphabetDP@aol.com wrote:
> In a message dated 6/21/2004 5:02:01 PM Eastern Standard Time,
> brion@pobox.com writes:
>
>>Note that currently we don't have diff-based storage; when you make a
>>change to a page the entire previous revision is stored in whole.
>>(Consider enabling $wgCompressOld if you have zlib support in PHP; this
>>will reduce old text requirements by roughly half.)
>>
>>-- brion vibber (brion @ pobox.com)
>
<snip>
> We had attempted to research the wiki's overhead requirements in making a
> judgment as to whether or not to buy more disk space from our provider. During
> the investigation of overhead storage requirements, we used the 'wikipedia'
> statistics and charts on space. It never occurred to us that 'wikipedia' was
> storing full copies of all versions of an article based on the 590MB May 22, 2004
> number and considering the high number of articles the db had. We must have
> been reading the wrong statistics.

You might have looked at the cur dump wich only hold the lastest
revision, not holding the old revisions. Compressed the sql dumps size
for the english wikipedia are:

cur : 269 MB
old : 7608 MB

The sizes of all wikipedias databases are available at:
http://www.wikipedia.org/wikistats/EN/TablesDatabaseSize.htm

In fact they are bigger :o)

> Do the 'wikipedia' administrators remove history from their wiki in order to
> preserve space? If so, how is this done? Is there some sort of 'export only
> the lastest version of each article, etc.' option, clear the db, and then import
> the lastest version back?

There is no such option, one might want to drop olders entries in the
"old" tables but you will then lost histories. The only thing deleted in
wikipedia databases are new articles which are vandalism / incorrect
data. They are dropped from the "cur" table but are still in "old" (as
far as I know).

> Our administrator has set the "$wgCompressRevisions = true;" since your
> message (above) -- will that take care of only the revisions since the flag was
> turned on or will there be compression of the previous revisions as well?

I think it will be only for revisions made after the flag got set, I am
not sure there is a ./maintenance/ script to compress revisions made
before the switch.


Hopefully the new diff based history will save lot of space.

--
Ashar Voultoiz
Re: Conserving Storage Space and Removing History [ In reply to ]
AlphabetDP@aol.com wrote:
> It never occurred
> to us that 'wikipedia' was storing full copies of all versions of an
> article based on the 590MB May 22, 2004 number and considering the high
> number of articles the db had. We must have been reading the wrong
> statistics.

That sounds like the table of current revisions. The old revisions table
for en.wikipedia.org is over 10GB.

> Do the 'wikipedia' administrators remove history from their wiki in
> order to preserve space? If so, how is this done? Is there some sort of
> 'export only the lastest version of each article, etc.' option, clear
> the db, and then import the lastest version back?

No.

> Our administrator has set the "$wgCompressRevisions = true;" since your
> message (above) -- will that take care of only the revisions since the
> flag was turned on or will there be compression of the previous
> revisions as well?

That only affects newly saved revisions. There is a 'compressOld.php'
script in the maintenance directory which will go through and get the rest.

-- brion vibber (brion @ pobox.com)
Re: Conserving Storage Space and Removing History [ In reply to ]
AlphabetDP@aol.com wrote:
In a message dated 6/21/2004 5:02:01 PM Eastern Standard Time, brion@pobox.com writes:

Note that currently we don't have diff-based storage; when you make a
change to a page the entire previous revision is stored in whole.
(Consider enabling $wgCompressOld if you have zlib support in PHP; this
will reduce old text requirements by roughly half.)

-- brion vibber (brion @ pobox.com)


Currently, my group's wiki is small. There are a few of us actively contributing right now, but that will probably change soon. The handful of volunteers that have been putting in content have also been learning the way of the wiki as they do so, making multiple edits on some rather lengthy articles, and innocently eating up storage space.

At the moment, our wiki is restricted to only registered users being able to contribute and only the sysop can create a registered user account.

We had attempted to research the wiki's overhead requirements in making a judgment as to whether or not to buy more disk space from our provider. During the investigation of overhead storage requirements, we used the 'wikipedia' statistics and charts on space. It never occurred to us that 'wikipedia' was storing full copies of all versions of an article based on the 590MB May 22, 2004 number and considering the high number of articles the db had. We must have been reading the wrong statistics.

Do the 'wikipedia' administrators remove history from their wiki in order to preserve space? If so, how is this done? Is there some sort of 'export only the lastest version of each article, etc.' option, clear the db, and then import the lastest version back?

Our administrator has set the "$wgCompressRevisions = true;" since your message (above) -- will that take care of only the revisions since the flag was turned on or will there be compression of the previous revisions as well?

I appreciate everyone's patience in this. I'm sort of the go-between right now. Hopefully our administrator will come online with this list and she can pose the questions more 'technically'. :)

Our versions:
MediaWiki: 1.3.0beta2
PHP: 4.3.4 (apache)
MySQL: 4.0.18

Take care,

Debi








_______________________________________________ MediaWiki-l mailing list MediaWiki-l@Wikimedia.org http://mail.wikipedia.org/mailman/listinfo/mediawiki-l"]http://mail.wikipedia.org/mailman/listinfo/mediawiki-l
They cant remove the old versions since that would violate the GFDL. I think.

Either way, I could write a patch sometime to implement a diff-based storage, with a feature to store the whole thing after n articles ('keyedits') to prevent corruption. Of course, that will be after I actually figure out how to hack MediaWiki, test out the patch, figure out how to logon to CVS, get the ability to access the CVS server, upload it and hope that the next time the test wiki is re-init'd with the latest edition that it doesnt crash or give some horrible error. And by the time I do half the patch I'll probably get bored and leave it on my hard drive for 2 months, find it, and realise I have to write it all over again to fit a new version. Which means I'll never actually do it.