Mailing List Archive: Search now working on Unicode Wikipedias

I hacked up the fulltext search/index code a bit to work on UTF-8
despite MySQL's lack of direct support: a Language::stripForSearch()
function is called to do any necessary mangling of character sets before
we store the indexable version of the text.

For Esperanto, Polish, Russian, Czech and Korean I set it to just fold
the text to lowercase (so search is case insensitive) and then convert
all UTF-8 sequences into hex strings which MySQL won't mistreat.

For Chinese and Japanese, things are a bit more complicated, as there is
no word spacing in the original text but the fulltext search works on
words. For Chinese I just set it to put spaces around every character;
it needs a lot of tweaking, but it sort of works. If you search a single
character it works great, but multi-character sequences don't behave as
expected.

For Japanese, I have it divide up the text at boundaries around chunks
of the same type of character (hiragana, katakana, or kanji), which does
a pretty good first approximation of dividing at the right place. It
could probably use some more work as well. When searching a word/short
phrase that divides across character types (ie, 'furansugo' which mixes
katakana and kanji) results may not be as expected.

-- brion vibber (brion @ pobox.com)

On Saturday 23 November 2002 07:22, Brion Vibber wrote:
> For Japanese, I have it divide up the text at boundaries around chunks
> of the same type of character (hiragana, katakana, or kanji), which does
> a pretty good first approximation of dividing at the right place. It
> could probably use some more work as well. When searching a word/short
> phrase that divides across character types (ie, 'furansugo' which mixes
> katakana and kanji) results may not be as expected.

AFAIK (which isn't much, I know hardly any Japanese), Japanese desinences are
written in hiragana and the rest of the word is written in kanji (if Chinese
or Japanese) or katakana (if anything else). So how about splitting wherever
a hiragana is followed by a katakana or kanji? But when a word written in
kanji is followed by another word written in kanji, neither algorithm will
know where to split it.

phma