Mailing List Archive

UTF-8 Conversion
So, Wikitravel uses the default ISO-8859-1 encoding, and we're
starting to need UTF-8 characters.

I think the process for converting might be as follows:

1. Shut down the site.
2. Backup the database with a data dump.
3. iconv the dump file to UTF-8.
4. Twiddle with mysql till it knows to use UTF-8.
5. Twiddle with PHP's php.ini till it knows to use UTF-8.
6. Change the encoding in LocalSettings.php to use UTF-8.
7. Delete everything in the database.
8. Import the data dump back in.
9. Turn the site back on.
10. Hope for the best.

Does this sound about right? Have any other MediaWiki installations
switched encodings midstream like this?

~ESP

--
Evan Prodromou <evan@wikitravel.org>
Wikitravel - http://www.wikitravel.org/
The free, complete, up-to-date and reliable world-wide travel guide
Re: UTF-8 Conversion [ In reply to ]
>>>>> "EP" == Evan Prodromou <evan@wikitravel.org> writes:

EP> So, Wikitravel uses the default ISO-8859-1 encoding, and we're
EP> starting to need UTF-8 characters.

EP> I think the process for converting might be as follows:

EP> 3. iconv the dump file to UTF-8. [...]
EP> 8. Import the data dump back in.

OK, so, this won't work. linkscc has binary data in it that mucks up
the dump file.

I think maybe skipping the linkscc table, and rebuilding it
afterwards, might work.

~ESP

--
Evan Prodromou <evan@wikitravel.org>
Wikitravel - http://www.wikitravel.org/
The free, complete, up-to-date and reliable world-wide travel guide
Re: UTF-8 Conversion [ In reply to ]
On Dec 6, 2003, at 14:05, Evan Prodromou wrote:
> EP> 3. iconv the dump file to UTF-8. [...]
> EP> 8. Import the data dump back in.
>
> OK, so, this won't work. linkscc has binary data in it that mucks up
> the dump file.
>
> I think maybe skipping the linkscc table, and rebuilding it
> afterwards, might work.

The only tables that should contain binary data at present are linkscc
(gzipped data) and math (a binary hash value). Both of these tables'
contents are volatile: you can just clear them out, and they will be
regenerated when needed.

-- brion vibber (brion @ pobox.com)
Re: UTF-8 Conversion [ In reply to ]
>>>>> "BV" == Brion Vibber <brion@pobox.com> writes:

BV> The only tables that should contain binary data at present are
BV> linkscc (gzipped data) and math (a binary hash value). Both of
BV> these tables' contents are volatile: you can just clear them
BV> out, and they will be regenerated when needed.

So, I think what my new strategy is is this:

1. Shut down the site.
2. Backup these tables with a data dump:

* archive
* cur
* image
* interwiki
* ipblocks
* old
* oldimage
* user
* user_newtalk
* watchlist

3. iconv the dump file to UTF-8.
4. Twiddle with mysql till it knows to use UTF-8.
5. Twiddle with PHP's php.ini till it knows to use UTF-8.
6. Change the encoding in LocalSettings.php to use UTF-8.
8. Reinstall MediaWiki, wiping the DB.
9. Import the data dump back in.
10. Rebuild links and RC tables with rebuildall.php.
11. Turn the site back on.
12. Hope for the best.

I guess at 1 I could just lock the database, rebuild in a new
DB, and at 11 change localsettings so it points to the new db.

That might work.

~ESP

--
Evan Prodromou <evan@wikitravel.org>
Wikitravel - http://www.wikitravel.org/
The free, complete, up-to-date and reliable world-wide travel guide