Mailing List Archive

php-wiki de
I have set up the test site at http://test-de.wikipedia.com/, I had to
remove an insert from the newiki.sql because it was causing an error.
Of course, I lost the exact text of the error somewhere, but it
basically said that there was already an entry for Main_Page (the
entry I removed looked like it may have been a discussion forum for
the actual main page). However, it was a fairly large chunk. So,
take a look.

A diff to my modified newiki.sql is attached.

--
"Jason C. Richey" <jasonr@bomis.com>
Re: php-wiki de [ In reply to ]
> I have set up the test site at http://test-de.wikipedia.com/,

Thanks Jason! I informed the German Wikipedians and hope they'll start
playing around with it soon.

I wrote down some bugs I found on
http://test-de.wikipedia.com/wiki/wikipedia:Beobachtete+Fehler

But that's in German, and I don't know anybody else but Magnus will
understand it. So shell I post a (bad) translated report to wikitech-l
from time to time?

* The search for words with umlauts doesn't seem to work. The script
doesn't find them, and when I search for more than one word (including
one with umlauts) I get MySQL errors.

* The conversion didn't correct the upper and lower case problem in
article titles. I thought someone made a script for this?

* And can't we prevent the loss of article histories when converting to
the new format?

BTW, maybe it's possible to keep an unlimited revision history (and
import the old ones) when the script is faster again, or when we change
to the new server.

More complaints coming soon ... ;-)

Kurt
Re: [Intlwiki-l] Re: php-wiki de [ In reply to ]
On mar, 2002-05-21 at 06:28, Kurt Jansson wrote:
> > I have set up the test site at http://test-de.wikipedia.com/,
>
> Thanks Jason! I informed the German Wikipedians and hope they'll start
> playing around with it soon.
>
> I wrote down some bugs I found on
> http://test-de.wikipedia.com/wiki/wikipedia:Beobachtete+Fehler
>
> But that's in German, and I don't know anybody else but Magnus will
> understand it. So shell I post a (bad) translated report to wikitech-l
> from time to time?
>
> * The search for words with umlauts doesn't seem to work. The script
> doesn't find them,

Hmm, looks like the search needs another drubbing. I believe it's
chopping up words at the boundaries of non-ASCII characters; if your
word is big enough on either end (übersetzung, terroranschläge) this
does a good enough job anyway, but it's hardly the way it should work!

I'll try it on my own machine later today; if it works there, I'll send
some MySQL character set config tweaks I've made that may help.

> and when I search for more than one word (including
> one with umlauts) I get MySQL errors.

I'm not getting that, this seems to work for me. (Try for example
"terroranschläge staaten".)

Note of course that the dreaded 4-character minimum is still in place.

> * The conversion didn't correct the upper and lower case problem in
> article titles. I thought someone made a script for this?

Working on it...

> * And can't we prevent the loss of article histories when converting to
> the new format?
>
> BTW, maybe it's possible to keep an unlimited revision history (and
> import the old ones) when the script is faster again, or when we change
> to the new server.

We already do this since ages. (But nobody's yet run the recovery script
which will add these to the *English* version which was converted with
an older, more primitive conversion script.)

Which articles are missing histories? I checked a few at random and so
far, so good.

> More complaints coming soon ... ;-)

Vielen dank!

-- brion vibber (brion @ pobox.com)
Re: [Intlwiki-l] Re: php-wiki de [ In reply to ]
On Tue, May 21, 2002 at 10:38:18AM -0700, Brion L. VIBBER wrote:
> On mar, 2002-05-21 at 06:28, Kurt Jansson wrote:
> > > I have set up the test site at http://test-de.wikipedia.com/,
> >
> > Thanks Jason! I informed the German Wikipedians and hope they'll start
> > playing around with it soon.
> >
> > I wrote down some bugs I found on
> > http://test-de.wikipedia.com/wiki/wikipedia:Beobachtete+Fehler
> >
> > But that's in German, and I don't know anybody else but Magnus will
> > understand it. So shell I post a (bad) translated report to wikitech-l
> > from time to time?
> >
> > * The search for words with umlauts doesn't seem to work. The script
> > doesn't find them,
>
> Hmm, looks like the search needs another drubbing. I believe it's
> chopping up words at the boundaries of non-ASCII characters; if your
> word is big enough on either end (übersetzung, terroranschläge) this
> does a good enough job anyway, but it's hardly the way it should work!

It doesn't (unless you have changed this in my code, I didn't check). The
parsing function usses the PERL regular expression \w to decide if something
is legal in a search word or not. If it thinks a character is illegal it
gives an error so you should have noticed that. If the result is empty and
you didn't get a syntax error there can be two problems:
- MySQL doesn't index that character. I wasn't able to find out which
characters are exactly indexed by the full-text index and which not. I know
that characters with umlauts get indexed (search for "G"odel" for example)
but I wouldn't know about the ringel-S (sp?) for example. The easiest way to
find out is probably simply trying.
- The special characters in the articles were written by using entities. The
search doesn't know that "o and &ouml; are the same.

Btw, if you have questions about the parse function for the search, just
ask. I'm very busy at the moment so I don't have time to do any real
programming, but answering a few questions should not be a problem.

-- Jan Hidders
Re: [Intlwiki-l] Re: php-wiki de [ In reply to ]
>On mar, 2002-05-21 at 06:28, Kurt Jansson wrote:
>> > I have set up the test site at http://test-de.wikipedia.com/,
>>
>> Thanks Jason! I informed the German Wikipedians and hope they'll
start
>> playing around with it soon.
>>
>> I wrote down some bugs I found on
>> http://test-de.wikipedia.com/wiki/wikipedia:Beobachtete+Fehler
>>
>> But that's in German, and I don't know anybody else but Magnus will
>> understand it. So shell I post a (bad) translated report to
wikitech-l
>> from time to time?

The list is growing! Brion, how's your German? :-)


> > and when I search for more than one word (including
> > one with umlauts) I get MySQL errors.
>
> I'm not getting that, this seems to work for me. (Try for example
> "terroranschläge staaten".)
>
> Note of course that the dreaded 4-character minimum is still in place.

Okay, my word fell under the minimum because of the chopping. "Berlin
Fläche" gives me

Warning: Supplied argument is not a valid MySQL result resource in
/home/wiki-de-php/fpw/special_dosearch.php on line 324

Warning: Supplied argument is not a valid MySQL result resource in
/home/wiki-de-php/fpw/special_dosearch.php on line 326

Warning: Supplied argument is not a valid MySQL result resource in
/home/wiki-de-php/fpw/special_dosearch.php on line 334

Warning: Supplied argument is not a valid MySQL result resource in
/home/wiki-de-php/fpw/special_dosearch.php on line 336

> Which articles are missing histories? I checked a few at random and so
> far, so good.

Sorry, forget about this. I was talking nonsense.

> > More complaints coming soon ... ;-)
>
> Vielen dank!

And, sorry, I forgot to mention that the new script all in all REALLY
KICKS ASS!!! I love it.

Kurt
Re: [Intlwiki-l] Re: php-wiki de [ In reply to ]
On mer, 2002-05-22 at 10:36, Kurt Jansson wrote:
> >> I wrote down some bugs I found on
> >> http://test-de.wikipedia.com/wiki/wikipedia:Beobachtete+Fehler
> >>
> >> But that's in German, and I don't know anybody else but Magnus will
> >> understand it. So shell I post a (bad) translated report to
> wikitech-l
> >> from time to time?
>
> The list is growing! Brion, how's your German? :-)

Nicht so gut, but I'll try to muddle through the list mit mein
Wörterbuch und verstehen ein bisse...

> Okay, my word fell under the minimum because of the chopping. "Berlin
> Fläche" gives me
>
> Warning: Supplied argument is not a valid MySQL result resource in
> /home/wiki-de-php/fpw/special_dosearch.php on line 324

If I actually make the software check return codes and report useful
error messages, I see:

You have an error in your SQL syntax near 'MATCH (cur_ind_title)
AGAINST ("Fläche") ) AND cur_t' at line 3

Looks like mismatched parentheses. I've replaced (what I think is) the
rest of the \w regexps with [\w\x80-\xff], and it looks improved, but as
much of the search code is a mystery to me, I can't guarantee anything!

(Some changes to special_dosearch.php now in CVS.)

> > Which articles are missing histories? I checked a few at random and so
> > far, so good.
>
> Sorry, forget about this. I was talking nonsense.

(Whew!)

> > > More complaints coming soon ... ;-)
> >
> > Vielen dank!
>
> And, sorry, I forgot to mention that the new script all in all REALLY
> KICKS ASS!!! I love it.

Oh, that's so much nicer than "it's too slow!" or "that's a well-meaning
feature but poorly implemented" or "make sure everything works 100%
before putting it online". You're my new favorite user! ;)

-- brion vibber (brion @ pobox.com)
Re: [Intlwiki-l] Re: php-wiki de [ In reply to ]
On Wed, May 22, 2002 at 11:14:06AM -0700, Brion L. VIBBER wrote:
>
> If I actually make the software check return codes and report useful
> error messages, I see:
>
> You have an error in your SQL syntax near 'MATCH (cur_ind_title)
> AGAINST ("Fläche") ) AND cur_t' at line 3
>
> Looks like mismatched parentheses. I've replaced (what I think is) the
> rest of the \w regexps with [\w\x80-\xff], and it looks improved, but as
> much of the search code is a mystery to me, I can't guarantee anything!

It looks Ok to me. Having said that can I as one of the contributers of the
search code make a small protest here? [.But feel free to ignore me because
I've been away for too long without due notice.] I checked what \w actually
matches on my system here and it matches the following ASCII codes
(decimal):

48 - 57 ( '0' - '9' )
65 - 90 ( 'A' - 'Z' )
95 ( '_' )
79 - 122 ( 'a' - 'z' )
170
181
192 - 214
216 - 246
248 - 255

So that includes all the German characters (or has our encoding scheme
changed?) and it should have worked with simply \w (and it did, as I said,
searching for "Go\"del" went fine. This is important because (1) our error
reporting should be as tight as possible and not just give an empty search
result if the user types a character that isn't indexed and (2) if something
did really go wrong before then your quick-fix is now hiding a bug that may
come to haunt us later.

-- Jan Hidders
Re: [Intlwiki-l] Re: php-wiki de [ In reply to ]
On mer, 2002-05-22 at 14:43, Jan.Hidders wrote:
> On Wed, May 22, 2002 at 11:14:06AM -0700, Brion L. VIBBER wrote:
> >
> > If I actually make the software check return codes and report useful
> > error messages, I see:
> >
> > You have an error in your SQL syntax near 'MATCH (cur_ind_title)
> > AGAINST ("Fläche") ) AND cur_t' at line 3
> >
> > Looks like mismatched parentheses. I've replaced (what I think is) the
> > rest of the \w regexps with [\w\x80-\xff], and it looks improved, but as
> > much of the search code is a mystery to me, I can't guarantee anything!
>
> It looks Ok to me. Having said that can I as one of the contributers of the
> search code make a small protest here? [.But feel free to ignore me because
> I've been away for too long without due notice.] I checked what \w actually
> matches on my system here and it matches the following ASCII codes
> (decimal):
>
> 48 - 57 ( '0' - '9' )
> 65 - 90 ( 'A' - 'Z' )
> 95 ( '_' )
> 79 - 122 ( 'a' - 'z' )
> 170
> 181
> 192 - 214
> 216 - 246
> 248 - 255
>
> So that includes all the German characters (or has our encoding scheme
> changed?)

If you'll recall, we're switching all the wikipedias to UTF-8 (if we
don't do it now, we'll just end up doing it in a few years and it'll be
more painful). The above covers some, but not all UTF-8 sequences that
encode valid letters.

In an ideal world, locale settings would apply (and work correctly and
consistently!) and \w would match everything necessary, but... (PHP
4.1.0 has some sort of special UTF-8 mode for regexps that might or
might not be useful here.)

> and it should have worked with simply \w (and it did, as I said,
> searching for "Go\"del" went fine. This is important because (1) our error
> reporting should be as tight as possible and not just give an empty search
> result if the user types a character that isn't indexed and (2) if something
> did really go wrong before then your quick-fix is now hiding a bug that may
> come to haunt us later.

Oh, I don't doubt that at all. :)

-- brion vibber (brion @ pobox.com)
Re: [Intlwiki-l] Re: php-wiki de [ In reply to ]
On mar, 2002-05-21 at 06:28, Kurt Jansson wrote:
> * The conversion didn't correct the upper and lower case problem in
> article titles. I thought someone made a script for this?

Okay, I finally got a chance to finish that up. I've just checked in
recaseLinks.php, which looks through the link tables with the critical
eye of the old capitalization function, counts up the various alternate
name forms in use for each page, and renames articles to the form most
frequently used in links. It also puts in redirects for other forms that
were used, so there shouldn't be breakage.

Jason, if you could give this a try on the German test site:
php recaseLinks.php
php rebuildLinks.php # The link tables then need to be rebuilt

(The Esperanto wiki also needs it, but it still needs to be converted
again to see if the alternate charset function works as I think it
should. I'd be mucho grateful if you could do that, so I can set people
loose on bug & message-translation checking!)

-- brion vibber (brion @ pobox.com)
Re: [Intlwiki-l] Re: php-wiki de [ In reply to ]
On Wed, May 22, 2002 at 03:34:08PM -0700, Brion L. VIBBER wrote:
>
> If you'll recall, we're switching all the wikipedias to UTF-8 (if we don't
> do it now, we'll just end up doing it in a few years and it'll be more
> painful).

Uh oh. I remember we had that discussion and as I already said then: moving
to UTF-8 breaks the fulltext search. This is because the indexing algorithm
assumes we use Latin-1 and bases upon that its decision to collate and
chooses which characters to index. For the German characters this will
probably more or less work out, but if you go you beyond (why else use
UTF-8?) that you will get into severe trouble.

Do you realize how serious this situation is?

> In an ideal world, locale settings would apply (and work correctly and
> consistently!) and \w would match everything necessary, but... (PHP 4.1.0
> has some sort of special UTF-8 mode for regexps that might or might not be
> useful here.)

It's not so much PHP that is the problem, as it is MySQL. Perhaps we should
considering moving to PostgreSQL which really supports UTF-8 and is a better
database anyway (some special pages could be implemented far more
efficiently there).

-- Jan Hidders
Re: [Intlwiki-l] Re: php-wiki de [ In reply to ]
On ĵaŭ, 2002-05-23 at 02:21, Jan.Hidders wrote:
> On Wed, May 22, 2002 at 03:34:08PM -0700, Brion L. VIBBER wrote:
> >
> > If you'll recall, we're switching all the wikipedias to UTF-8 (if we don't
> > do it now, we'll just end up doing it in a few years and it'll be more
> > painful).
>
> Uh oh. I remember we had that discussion and as I already said then: moving
> to UTF-8 breaks the fulltext search.

Fulltext search is already broken in a million ways! It doesn't know
character references (&uuml;, &#x109;, &#303; etc), it can't find
partial matches or sounds-likes, it can't find "X" when you search for
"Xs" or "Xs" when you search for "X", it doesn't return *ANY* results
for words it thinks are too common...

UTF-8 is the least of our problems; it just means that case-folding is a
little trickier (and if we had a decent $*#@%# database, it would take
care of that for us).

> This is because the indexing algorithm
> assumes we use Latin-1 and bases upon that its decision to collate and
> chooses which characters to index. For the German characters this will
> probably more or less work out, but if you go you beyond (why else use
> UTF-8?) that you will get into severe trouble.
>
> Do you realize how serious this situation is?

Yes, that's *exactly* why we have to break out of the "everything in the
world should be Latin-1, oh and by the way even though we have limited
support for other character sets -- but not the ones YOU need -- you can
only select one single character set for the whole database server!
MWOOHAAHAAHAAA!" rut and make it work for the rest of us too.

> > In an ideal world, locale settings would apply (and work correctly and
> > consistently!) and \w would match everything necessary, but... (PHP 4.1.0
> > has some sort of special UTF-8 mode for regexps that might or might not be
> > useful here.)
>
> It's not so much PHP that is the problem, as it is MySQL.

The particular problem I was discussing was the regexps, which were a
PHP problem. MySQL could do all the magic it wanted; if the words don't
get through the regexps in the PHP code they'll never get anywhere in
the database's fulltext search.

> Perhaps we should
> considering moving to PostgreSQL which really supports UTF-8 and is a better
> database anyway (some special pages could be implemented far more
> efficiently there).

Oh, it's not like that hasn't been suggested. If anybody knows how to go
about switching to Postgres, I sure as heck wouldn't object. I have no
emotional attachment to MySQL; as far as I know it's only being used
because Magnus was already familiar with it.

-- brion vibber (brion @ pobox.com)
Re: [Intlwiki-l] Re: php-wiki de [ In reply to ]
On Thu, May 23, 2002 at 03:42:59AM -0700, Brion L. VIBBER wrote:
> On ĵaŭ, 2002-05-23 at 02:21, Jan.Hidders wrote:
> >
> > Uh oh. I remember we had that discussion and as I already said then:
> > moving to UTF-8 breaks the fulltext search.
>
> Fulltext search is already broken in a million ways! It doesn't know
> character references (&uuml;, &#x109;, &#303; etc), it can't find
> partial matches or sounds-likes, it can't find "X" when you search for
> "Xs" or "Xs" when you search for "X", it doesn't return *ANY* results
> for words it thinks are too common...

All true, although the last problem will be solved when we move to the new
MySQL and use the boolean search there. But at least there was a
well-defined semantics: if a word shows up in the edit-text the search will
find it. This is now no longer the case and no clever PHP programming can
solve this.

> UTF-8 is the least of our problems; it just means that case-folding is a
> little trickier (and if we had a decent $*#@%# database, it would take
> care of that for us).

Case folding is not the only problem. The problem is that the fulltext index
does not index certain characters above 128. That means that words that
contain multibyte characters that are represented using such characters will
not be indexed. That's a bit harder to explain to the users than the
previous problems.

> > It's not so much PHP that is the problem, as it is MySQL.
>
> The particular problem I was discussing was the regexps, which were a
> PHP problem. MySQL could do all the magic it wanted; if the words don't
> get through the regexps in the PHP code they'll never get anywhere in
> the database's fulltext search.

I know, but what I am saying is that this is something we can solve with PHP
programming. But if MySQL doesn't index it then we can we can let through all
we want but it won't turn up in the search results. Period.

> > Perhaps we should considering moving to PostgreSQL which really supports
> > UTF-8 and is a better database anyway (some special pages could be
> > implemented far more efficiently there).
>
> Oh, it's not like that hasn't been suggested. If anybody knows how to go
> about switching to Postgres, I sure as heck wouldn't object.

Unfortunately, I only have very limited hands-on experience with PostgreSQL
and no experience with its fulltext indexing. From the manuals I gather it
does have such a thing but it takes some configuring. On the up-side it
seems far more configurable and you can in fact change the parser that looks
for which words to index, how to collate, et cetera. It was written by
Russians, I believe, who are generally more aware of character set problems
anyway.

But the problem is that I don't have the time to research this; next month I
have a conference and I have two dead-lines for articles coming up.

-- Jan Hidders
Re: [Intlwiki-l] Re: php-wiki de [ In reply to ]
Here's a brief history of why we use MySQL:

Bomis' first database experience was using the software that runs
slashdot. At the time, it required MySQL (other databases are
supported now). So, we installed it. Our next database project
(Nupedia) was shortly thereafter, and we had no reason to learn a
whole different database setup. So, when Wikipedia came along, and we
decided to make it database-driven, it just didn't make sense to
change databases, especially since it would mean we would have to
change Nupedia to the new format.

Now, the programmer who wrote Nupedia (for the most part) is no longer
with Bomis. I'm not sure what, if any, MySQL specific properties
Nupedia uses (or Wikipedia, for that matter). I am not opposed to
installing a different database, but I can't really spend a lot of
time on the conversion of the software. Of course, it's not really my
decision anyway...

Jason Richey

Brion L. VIBBER wrote:

> Oh, it's not like that hasn't been suggested. If anybody knows how to go
> about switching to Postgres, I sure as heck wouldn't object. I have no
> emotional attachment to MySQL; as far as I know it's only being used
> because Magnus was already familiar with it.

--
"Jason C. Richey" <jasonr@bomis.com>
Re: [Intlwiki-l] Re: php-wiki de [ In reply to ]
Jan.Hidders wrote:
> It's not so much PHP that is the problem, as it is MySQL. Perhaps we should
> considering moving to PostgreSQL which really supports UTF-8 and is a better
> database anyway (some special pages could be implemented far more
> efficiently there).

A good opportunity for switching from MySQL to PostgreSQL is coming
soon, with the move from our current server to a new and much more
powerful server. I'd be happy to install PostgreSQL over there
instead. And we'll be giving out developer accounts to several
people, including root access where appropriate, so things are going
to be a lot easier for the development team.

I propose this as also a good time to switch to Lee's refactored
codebase, unless the consensus is that his code is not better than the
existiing codebase.

--Jimbo
RE: [Intlwiki-l] Re: php-wiki de [ In reply to ]
> -----Original Message-----
> From: wikitech-l-admin@nupedia.com
> [mailto:wikitech-l-admin@nupedia.com]On Behalf Of Jason Richey
> Sent: Thursday, May 23, 2002 6:58 PM
> To: Brion L. VIBBER
> Cc: Jan.Hidders; Wikitech-L
> Subject: Re: [Intlwiki-l] Re: [Wikitech-l] php-wiki de
>
> Now, the programmer who wrote Nupedia (for the most part) is no longer
> with Bomis. I'm not sure what, if any, MySQL specific properties
> Nupedia uses (or Wikipedia, for that matter). I am not opposed to
> installing a different database, but I can't really spend a lot of
> time on the conversion of the software. Of course, it's not really my
> decision anyway...

The current Nupedia software seems pretty unusable at the moment (copyeditor
pages, among others). Personally, I doubt it will see the rebirth of
Nupedia. As I keep advertising ;) I'm rewriting the whole thing
(http://nunupedia.sourceforge.net). I doubt any MySQL-specific queries will
be necessary in that project, so there's no need to keep MySQL just for
Nupedia's sake.

As for Wikipedia, except for the search function, I don't think there's much
(if any) MySQL-specific stuff in there. Shouldn't be too hard to change to
something else. Maybe we could use the Lee's test server for trying???

Magnus
Re: [Intlwiki-l] Re: php-wiki de [ In reply to ]
On ĵaŭ, 2002-05-23 at 04:52, Jan.Hidders wrote:
> On Thu, May 23, 2002 at 03:42:59AM -0700, Brion L. VIBBER wrote:
> > Fulltext search is already broken in a million ways! It doesn't know
> > character references (&uuml;, &#x109;, &#303; etc), it can't find
> > partial matches or sounds-likes, it can't find "X" when you search for
> > "Xs" or "Xs" when you search for "X", it doesn't return *ANY* results
> > for words it thinks are too common...
>
> All true, although the last problem will be solved when we move to the new
> MySQL and use the boolean search there. But at least there was a
> well-defined semantics: if a word shows up in the edit-text the search will
> find it. This is now no longer the case and no clever PHP programming can
> solve this.

What's not well-defined about "We're having some problems with the
search engine right now; if you don't find what you're looking for at
first, try capitalizing it and search again." ?

> > UTF-8 is the least of our problems; it just means that case-folding is a
> > little trickier (and if we had a decent $*#@%# database, it would take
> > care of that for us).
>
> Case folding is not the only problem. The problem is that the fulltext index
> does not index certain characters above 128. That means that words that
> contain multibyte characters that are represented using such characters will
> not be indexed. That's a bit harder to explain to the users than the
> previous problems.

If it is in fact having problems indexing chars over 128, install the
attached hacked character set definition file and reindex the database.

my.cnf -> /etc/ (or wherever)
Index, custom.conf -> /usr/share/mysql/charsets/ (or wherever)
Run 'myisamchk -r -q' over the tables.

Works great on my machine...

-- brion vibber (brion @ pobox.com)
Re: [Intlwiki-l] Re: php-wiki de [ In reply to ]
On Thu, May 23, 2002 at 10:59:13AM -0700, Brion L. VIBBER wrote:
>
> If it is in fact having problems indexing chars over 128, install the
> attached hacked character set definition file and reindex the database.

Wow! A customized character set! I didn't know that was possible in MySQL.
That looks indeed like a solution. I bow before you, oh master.

-- Jan Hidders

PS. I'd still favour a move to PostgreSQL though, which would probably be a
cleaner solution and has also some other benifits. But before we do that I
would like to be very very sure that fulltext indexing (which is now I
believe the only MySQL-specific thing we use) works well there.
Re: [Intlwiki-l] Re: php-wiki de [ In reply to ]
On Thu, May 23, 2002 at 09:33:21PM +0200, Jan.Hidders wrote:
>
> PS. I'd still favour a move to PostgreSQL though, which would probably be a
> cleaner solution and has also some other benifits. But before we do that I
> would like to be very very sure that fulltext indexing (which is now I
> believe the only MySQL-specific thing we use) works well there.

An interesting alternative might be to do the indexing ourselves. This may
sound very inefficient but let me explain. One method/trick that was used in
PgSQL to get a fulltext index was to create someting like a separate table

text_index( search_word, document_id )

with a conventional index on the column search_word. Now, every time a row
is added to the table that you want to be searchable you retrieve all the
search words (and their suffixes) and add the corresponding
pairs to the table text_index. So if the text in document with id '5' is
"Text Indexing" you add the following pairs to table text_index:

( "text", 5 )
( "ext", 5 )
( "xt", 5 )
( "indexing", 5 )
( "ndexing", 5 )
...

If you now want to search for an article containing "dex" you can query the
text_index table and use in your WHERE clause something like

"search_word LIKE 'dex%'"

In this case the index for the search_word column is used because the LIKE
expression doesn't start with a wild-card. As you can see this will also
give you the partial matches. In PgSQL this is easy to implement because you
can define triggers that call certain functions whenever a row is inserted
or deleted in a table. But also in MySQL we could do this, although we
would then have to ensure ourselves that every time an article is
created/updated/deleted the text_index table is also updated.

The nice part is that we would have total control over what is indexed and
how it is indexed in the form of a PHP function that determines which pairs
are added to text_index. This function could take care of collation and
could even do nifty things like taking entity references into account or let
a-umlaut match with ae. The different Wikipedias would probably all have a
localized version of this function.

The down-side is of course some performance when articles are updates and
the extra work we need to do on writing this PHP function (although initial
version could be quite simple).

-- Jan Hidders