Mailing List Archive

A Line in LocalSettings.php
What's this all about?

$wgDBminWordLen = 3; # Match this to your MySQL fulltext

Fred
Re: A Line in LocalSettings.php [ In reply to ]
On Fri, 21 Feb 2003, Fred Bauder wrote:
> What's this all about?
>
> $wgDBminWordLen = 3; # Match this to your MySQL fulltext

MySQL's FULLTEXT indexing is used for the search function; the index by
default ignores words shorted than some number of letters (3 or 4?).

Our search function does a primitive boolean search by parsing the search
query into words (eg, "dime a dozen" -> "dime", "a", "dozen") and doing
separate MATCH queries on each one using the index, then ANDing the
results together logically. So an article has to match all words to come
up in the results. But, the "a" is too short (and anyway a stopword --
another issue itself) and so ignored by the search; it doesn't match *any*
articles.

So, we have to know which words are going to be ignored by MySQL's
fulltext search so we can skip them. "dime a dozen" -> "dime" and "dozen".

This would go away using MySQL 4, which has a boolean search mode for its
fulltext searching, but since there are a lot of mysql3 installations out
there we'd have to keep this around as an option.

See the mySQL docs:
http://www.mysql.com/doc/en/Fulltext_Search.html

-- brion vibber (brion @ pobox.com)
Re: A Line in LocalSettings.php [ In reply to ]
Brion Vibber wrote:

> MySQL's FULLTEXT indexing is used for the search function; the index by
> default ignores words shorted than some number of letters (3 or 4?).
>
> Our search function does a primitive boolean search by parsing the search
> query into words (eg, "dime a dozen" -> "dime", "a", "dozen") and doing
> separate MATCH queries on each one using the index, then ANDing the
> results together logically. So an article has to match all words to come

If you just let MySQL 3.x match the phrase "dime a dozen", it will
return any entries that contain either "dime" or "dozen" and give
higher ranks to those that contain both or lots of these words. It's
more of an "or" than an "and", but it works very well. Is it really
worth the hassle to go through the splitting and anding? Do all users
really want the strict "and"? When I use Google I don't really want
an empty hit list.


--
Lars Aronsson (lars@aronsson.se)
Aronsson Datateknik - http://aronsson.se/
Re: A Line in LocalSettings.php [ In reply to ]
On Mon, 24 Feb 2003, Lars Aronsson wrote:
> If you just let MySQL 3.x match the phrase "dime a dozen", it will
> return any entries that contain either "dime" or "dozen" and give
> higher ranks to those that contain both or lots of these words.

Yup.

> It's
> more of an "or" than an "and", but it works very well. Is it really
> worth the hassle to go through the splitting and anding? Do all users
> really want the strict "and"?

All I know is I hear complaints when I change to the straight search
method as a performance hack and people find that "John Smith" turns up
mostly results like "John Wigglesworth" and "Michael Smith".

-- brion vibber (brion @ pobox.com)
Re: A Line in LocalSettings.php [ In reply to ]
> (Brion Vibber <vibber@aludra.usc.edu>):
> On Mon, 24 Feb 2003, Lars Aronsson wrote:
> > If you just let MySQL 3.x match the phrase "dime a dozen", it will
> > return any entries that contain either "dime" or "dozen" and give
> > higher ranks to those that contain both or lots of these words.
>
> Yup.
>
> > It's
> > more of an "or" than an "and", but it works very well. Is it really
> > worth the hassle to go through the splitting and anding? Do all users
> > really want the strict "and"?
>
> All I know is I hear complaints when I change to the straight search
> method as a performance hack and people find that "John Smith" turns up
> mostly results like "John Wigglesworth" and "Michael Smith".

Also, the min-length thing is independent of that choice anyway. By
default, MySQL won't index any word fewer than 4 letters long, and we
got lots of complaints that you couldn't search for "PNG" or "XP".
MySQL has to be recompiled to change that limit, and the local setting
just informs the wiki software of how MySQL was compiled so it knows
what it can hand to the indexer.

--
Lee Daniel Crocker <lee@piclab.com> <http://www.piclab.com/lee/>
"All inventions or works of authorship original to me, herein and past,
are placed irrevocably in the public domain, and may be used or modified
for any purpose, without permission, attribution, or notification."--LDC
Re: A Line in LocalSettings.php [ In reply to ]
On Mon, 24 Feb 2003, Lee Daniel Crocker wrote:
> Also, the min-length thing is independent of that choice anyway. By
> default, MySQL won't index any word fewer than 4 letters long, and we
> got lots of complaints that you couldn't search for "PNG" or "XP".
> MySQL has to be recompiled to change that limit, and the local setting
> just informs the wiki software of how MySQL was compiled so it knows
> what it can hand to the indexer.

Well, yes and no. If we don't parse the query into separate words, it
doesn't matter how much we hand to it: a MATCH AGAINST( "Windows XP" )
will still turn up all the results for "Windows", whereas MATCH AGAINST(
"Windows" ) AND MATCH AGAINST ("XP") returns nothing at all -- that's why
we need to be aware of it and remove the too-short words from the search
using our hackish boolean system.

However, it would be nice to spit out a little message to the effect of
"The word 'XP' has been ignored in your search because MySQl doesn't like
it," whatever the search method.

-- brion vibber (brion @ pobox.com)
Re: A Line in LocalSettings.php [ In reply to ]
> (Brion Vibber <vibber@aludra.usc.edu>):
>
> Well, yes and no. If we don't parse the query into separate words, it
> doesn't matter how much we hand to it: a MATCH AGAINST( "Windows XP" )
> will still turn up all the results for "Windows", whereas MATCH AGAINST(
> "Windows" ) AND MATCH AGAINST ("XP") returns nothing at all -- that's why
> we need to be aware of it and remove the too-short words from the search
> using our hackish boolean system.
>
> However, it would be nice to spit out a little message to the effect of
> "The word 'XP' has been ignored in your search because MySQl doesn't like
> it," whatever the search method.

Yeah, the issues are intertwined a bit. BTW, I compiled MySQL on the
wikipedia server with a minimum word length of 2, so "XP" works fine.

--
Lee Daniel Crocker <lee@piclab.com> <http://www.piclab.com/lee/>
"All inventions or works of authorship original to me, herein and past,
are placed irrevocably in the public domain, and may be used or modified
for any purpose, without permission, attribution, or notification."--LDC
Re: A Line in LocalSettings.php [ In reply to ]
On Mon, 24 Feb 2003, Lee Daniel Crocker wrote:
> Yeah, the issues are intertwined a bit. BTW, I compiled MySQL on the
> wikipedia server with a minimum word length of 2, so "XP" works fine.

Still doesn't help with "vitamin E" or "C sharp"...

Any reason we can't take it down to 1?

-- brion vibber (brion @ pobox.com)
Re: A Line in LocalSettings.php [ In reply to ]
> (Brion Vibber <vibber@aludra.usc.edu>):
> On Mon, 24 Feb 2003, Lee Daniel Crocker wrote:
> > Yeah, the issues are intertwined a bit. BTW, I compiled MySQL on the
> > wikipedia server with a minimum word length of 2, so "XP" works fine.
>
> Still doesn't help with "vitamin E" or "C sharp"...
>
> Any reason we can't take it down to 1?

Two problems I can see: first, "A" and "I" clearly can't have
meaningful indexes, so even if we make "Vitamin C" searchable, it
won't work for "Vitamin A", causing confusion. Also, single
letters often appear in articles as links (see list of diseases,
for example) or at outline labels, etc., and would further make
single-letter searches less meaningful.

--
Lee Daniel Crocker <lee@piclab.com> <http://www.piclab.com/lee/>
"All inventions or works of authorship original to me, herein and past,
are placed irrevocably in the public domain, and may be used or modified
for any purpose, without permission, attribution, or notification."--LDC
Re: A Line in LocalSettings.php [ In reply to ]
On Mon, 24 Feb 2003, Lee Daniel Crocker wrote:
> > (Brion Vibber <vibber@aludra.usc.edu>):
> > Still doesn't help with "vitamin E" or "C sharp"...
> >
> > Any reason we can't take it down to 1?
>
> Two problems I can see: first, "A" and "I" clearly can't have
> meaningful indexes, so even if we make "Vitamin C" searchable, it
> won't work for "Vitamin A", causing confusion.

24/26ths less confusion than none of them working, I'd wager.

> Also, single letters often appear in articles as links (see list of
> diseases, for example) or at outline labels, etc., and would further
> make single-letter searches less meaningful.

In most cases they'd be meaningful _enough_ though, for two reasons:

a) alphabetical lists, "I", and "a" are very rare in article titles, which
we search separately from body text. Yes, there's going to be the
occasional spurious "Biographical index -- C through G", but does that
negate the utilitity of returning "C programming language" (and a few
other pages) instead of nothing to the hapless kid searching for that hip
programming language?

b) in most cases such searches will be in conjunction with other words.
"Vitamin C" or "Malcom X" will appear more often in conjunction when they
are in fact mentioned together than in random unrelated lists. There will
be some false positives, but that's better than many many false negatives.

-- brion vibber (brion @ pobox.com)