Mailing List Archive: Problem populating new database

Hello Wikipedians,

I am in the process of making a local mirror of the WikiPedia encyclopedia
and seem to have hit a stumbling block. But first a related question. I
have looked thru the software doc and the downloaded pages and haven't
seen this but just want to make sure. I have the base software Ver 1.2.3
installed from the web interface. Only had one small but surmountable
problem. When using IE 5.1.3 Under MacOS 9.0.4 I could not enter the name
for the site. The field was overlaid by the info that should have been to
its right. I switched to Netscape and then no problem.

Now I am in the process of populating the database and was wondering if in
the maintenance folder (or someplace else) there is a set of script to
fetch and upload the actual base data content and then the weekly
updates. I would like to keep this mirror up to date with the master copy.
From looking at mailing list archive I have seen it stated that there is
no doc file explaining what each of the maint. scripts does, and looking
them over hasn't yielded one to create/update the database. If one doesn't
exist I am ready to do it manually. But in my first attempts I have hit a
few problems.

The first trick is getting the correct data to do the upload with. I found
the dump D/L page and the files for the EN version of the database(dated
2004-04-03). The current one looks find and I have been able to retrieve
it and do some(not all) processing with it. My first problem is the old
database. It is my assumption that this contains the full database content
(minus images) prior to the new data in the current file. I notice that
the format of the old/full file has changed recently and grown a lot. I
tried to D/L the full DB as the single file and failed (403 - not
authorized) and this seems to not be unexpected since there is mention of
the multi part files for those experiencing problems. The single files
http://download.wikimedia.org/archives/en/20040403_cur_table.sql.bz2 and
http://download.wikimedia.org/archives/en/20040403_old_table.sql.bz2 have
names and formats that make sense to me. The partials have me confused,
especially with my inability to decompress them. First there are only
three files and based on the file sizes (and one unlisted file) it appears
that there should be four. The first three come to exactly 2Gigs each, a
mathematical oddity if that was all there was, but a fourth file would
even it out nicely. First the files themselves have names that give no
clue as to their contents. http://download.wikimedia.org/archives/en/xaa
xab xac and the unlisted xad . What format are these and how should they
be joined together? I copied them over via wget and then tried to merge
and decompress them but failed. The command and response I tried(to verify
before actual processing) was:

========= start of clip
-bash-2.05b$ nice bzip2 -t xaa xab xac xad
bzip2: xaa: file ends unexpectedly
bzip2: xab: bad magic number (file not created by bzip2)
bzip2: xac: bad magic number (file not created by bzip2)
bzip2: xad: bad magic number (file not created by bzip2)

You can use the `bzip2recover' program to attempt to recover
data from undamaged sections of corrupted files.
========= end of clip

Are these files damaged or am I just using the wrong software to do this?
BTW I am on a RH9 system running PHP434, MySQL 4.0.18 and Apache 2

To try and continue my testing and make sure I had everything else in
place I thought I'd try using the current file and see how that went. It
might not be all the data but it would give me a taste of how things were
going. The decompress went fine but I had a problem part way during the
load. The data that did load was enough for me to do some minimal testing
and verify that the software basically works and that I was close in doing
the upload. The command I tried and the response I got was:

========= start of clip
-bash-2.05b$ nice mysql -p -uxxxxxxx wikipedia < 20040403_cur_table.sql
Enter password:
ERROR 1153 at line 831: Got a packet bigger than 'max_allowed_packet'
-bash-2.05b$========= end of clip

What size should I be setting the 'max_allowed_packet' to?

Thanks in advance for your help and for creating this software and its
associated database.

Paul

http://PrivacyDigest.com/ Daily news from the privacy front.

PS The ls -al for the data files I Downloaded is:

-rw-r--r-- 1 wikipedia psacln 850374900 Apr 3 02:09
20040403_cur_table.sql
-rw-r--r-- 1 wikipedia psacln 2000000000 Mar 22 17:32 xaa
-rw-r--r-- 1 wikipedia psacln 2000000000 Mar 22 17:35 xab
-rw-r--r-- 1 wikipedia psacln 2000000000 Mar 22 17:38 xac
-rw-r--r-- 1 wikipedia psacln 1614740369 Mar 22 17:40 xad

On Apr 7, 2004, at 21:11, Paul Hardwick wrote:
> Only had one small but surmountable
> problem. When using IE 5.1.3 Under MacOS 9.0.4 I could not enter the
> name
> for the site. The field was overlaid by the info that should have been
> to
> its right. I switched to Netscape and then no problem.

I'll check this out, thanks for the note.

> Now I am in the process of populating the database and was wondering
> if in
> the maintenance folder (or someplace else) there is a set of script to
> fetch and upload the actual base data content and then the weekly
> updates. I would like to keep this mirror up to date with the master
> copy.

No, there is no such script. Unfortunately we don't yet have a good
procedure for synchronizing a mirror other than throwing out and
replacing the whole thing every week or so.

Just note, INSTALL THE WIKI FIRST, then load in the data. The dumps
*drop* the existing tables and replace them, and the install doesn't
like to run over a partial set of tables. (Command-line install will
drop any existing tables.)

> The partials have me confused,

First, the bad news. The partials weren't being updated automatically
by the backup process, so what you downloaded was about a month old. If
you want the April 3 backup, you'll have to grab them again. Sorry...
:(

Also, the split files are up to xae now. Compression of old revisions
reduces the raw disk space (& disk cache) needed for the table, but
totally ruins the compression ratio of the downloadable dumps.

> -bash-2.05b$ nice bzip2 -t xaa xab xac xad
> bzip2: xaa: file ends unexpectedly
> bzip2: xab: bad magic number (file not created by bzip2)

That will try to decompress each file in turn, which doesn't work; you
need to concatenate them back into a single stream. The simplest thing
might be to pipe it straight into mysql, assuming you're already set
up:

cat xa? | bzip2 -dc | mysql -u mywikiuser -p mydatabase

Or if you'd like to output a big decompressed SQL file:

cat xa? | bzip2 -dc > old_table_20040403.sql

> -bash-2.05b$ nice mysql -p -uxxxxxxx wikipedia < 20040403_cur_table.sql
> Enter password:
> ERROR 1153 at line 831: Got a packet bigger than 'max_allowed_packet'
> -bash-2.05b$========= end of clip
>
> What size should I be setting the 'max_allowed_packet' to?

I think 16MB is the maximum, try that.

-- brion vibber (brion @ pobox.com)