Mailing List Archive

update: new MySQL search index committed to CVS
L.S.,

I have just committed a rewritten version of the search engine to CVS. It is
now based upon the fulltext indexes as is offered by MySQL. In order to be
able to use this the database schema has to be extended a little. The table
types have changed to myISAM (MySQL own version of ISAM), there is an extra
redundant column, and two fulltext indexes have to be added.. The commands
to do this have been added to "updSchema.sql", just uncomment the commands
you still have to do. The resulting database scheme should be as in
"wikipedia.sql".

This new search engine is somewhat a mixed blessing.

First the good news
+ The main reason for introducing it is that the search queries were
reported as the slowest by the database, so they are eating up a lot of
database resources. So not only will it make searches faster but the whole
of Wikipedia will probably benefit from its introduction.
+ The search engine tries to estimate the relevance of pages wrt. to the
given search words and it is in this order that they are presented to the
user. Not sorted alphabetically, as now.
+ With the arrival of MySQL4 there will be new search possibilities such as
boolean searches and natural language searches.

And now for the bad news.
- I had to introduce an extra redundant column contain a duplicate of
cur_title. This is because the fulltext index cannot be defined on binary
columns.
- If you give the search engine multiple words it will also regards multiple
occurrences of one word already very relevant. So the term "Larry Sanger"
will also lead you to pages with only a lot of "Larry" on it.
- It does not search on small words of three letters or less. So the therm
"war" gives you zero results.
- It searches in the raw HTML so it doesn't know that Gödel and Gödel
are the same.

That's it for now. My next task will be the MostWanted Page.

-- Jan Hidders
Re: update: new MySQL search index committed to CVS [ In reply to ]
One cute trick that I have often used is to calculate the
"uselessness" of a particular word. A word is semantically more
useless if it appears more often. This has really dramatic empirical
results for the better, especially on small datasets. (Maybe on
really big ones, too, but I've never played with those.)

Thus if someone searches for 'John Malkovich' they get a good result,
because 'John' is not weighted so heavily -- it's a more useless word
because it appears more often in the search set. But 'Malkovich', now
you're talking, there's a word that _means something_.