Mailing List Archive

MySQL 4.1 & MediaWiki backup corruption warning
Just a note of warning for those of you using MySQL 4.1: changes in the
new charset options may result in mysqldump outputting bogus data into
backups which can't be restored without data loss.*

This may affect some Unicode text, and certainly can irretrievably
corrupt compressed old revision text (using $wgCompressRevisions
option). If you're using MySQL 4.1, you should probably examine and
test your backup dumps to make sure they can be restored and used
successfully.

Passing an option like --default-character-set=latin1 may stop
mysqldump from trying to 'convert' (and thus corrupt) your data. (If
your server is not set to the defaults, this may or may not be the
correct value for you.) In the future hopefully we'll be able to play
nicer with the new character set settings, but for now MediaWiki
follows prior practice for older versions of MySQL where there was (and
remains) no ability to correctly indicate the charset used in a
particular database, table, or field.

* Specifically, a default "latin-1" to UTF-8 conversion silently
corrupts all bytes with the values 0x81, 0x8d, 0x8f, 0x90, or 0x9d by
turning them into literal question marks. The question marks cannot be
returned to their original byte values when the data is re-imported.

-- brion vibber (brion @ pobox.com)