Mailing List Archive

Stopwords and page text search
Hi all,
Where are Search stop-words stored? I have the suspicion that the word
"gay" may be set as a stop-word by default, because on my MediaWiki
setup searching for it gives no results, even though it appears on at
least one page, and searching for any other words on that page show up
in the search page.

Thanks,
Matias
Re: Stopwords and page text search [ In reply to ]
Matias Pelenur wrote:
> Hi all,
> Where are Search stop-words stored? I have the suspicion that the word
> "gay" may be set as a stop-word by default, because on my MediaWiki
> setup searching for it gives no results, even though it appears on at
> least one page, and searching for any other words on that page show up
> in the search page.

There is a stopword list hard-coded into MySQL. (In 4.0 and up this can
be overridden by server-wide configuration.) MediaWiki includes a copy
of the default stopword list (FulltextStoplist.php) in order to take
them out of multiple-word searches (so if you search for "the united
nations", it will search only "united" and "nations", rather than
searching "the", returning no results, and thus not matching anything
for "united" or "nations" either). I think this is only used in MySQL 3
mode, so if you configured on MySQL 4 it won't use this mode, and it's
up to the list actually in MySQL.

Also words appearing in over 50% of the search space will not match;
this can affect very small databases particularly.

However your problem is likely the minimum word length limit; I believe
the default is four characters, so "gay" would not be found, nor would
"tea" or "gun" or "war" or "hat".

For the MySQL 3 mode we again trim out short words before putting them
to the search engine; you can override this by setting the variable
$wgDBminWordLen in LocalSettings.php. I don't think the check is done in
MySQL 4 mode, since it works differently using a more advanced mode in
the MySQL engine, but I'm not sure offhand. However you still may need
to adjust MySQL itself, see:
http://dev.mysql.com/doc/mysql/en/Fulltext_Fine-tuning.html

-- brion vibber (brion @ pobox.com)
Re: Stopwords and page text search [ In reply to ]
Cool, I set the variable in LocalSettings, added a line
"ft_min_word_len=3" to my.cnf, repaired the searchindex table, and now
it works just fine!

Thanks,
matias

Brion Vibber wrote:

> Matias Pelenur wrote:
>
>> Hi all,
>> Where are Search stop-words stored? I have the suspicion that the word
>> "gay" may be set as a stop-word by default, because on my MediaWiki
>> setup searching for it gives no results, even though it appears on at
>> least one page, and searching for any other words on that page show up
>> in the search page.
>
>
> There is a stopword list hard-coded into MySQL. (In 4.0 and up this can
> be overridden by server-wide configuration.) MediaWiki includes a copy
> of the default stopword list (FulltextStoplist.php) in order to take
> them out of multiple-word searches (so if you search for "the united
> nations", it will search only "united" and "nations", rather than
> searching "the", returning no results, and thus not matching anything
> for "united" or "nations" either). I think this is only used in MySQL 3
> mode, so if you configured on MySQL 4 it won't use this mode, and it's
> up to the list actually in MySQL.
>
> Also words appearing in over 50% of the search space will not match;
> this can affect very small databases particularly.
>
> However your problem is likely the minimum word length limit; I believe
> the default is four characters, so "gay" would not be found, nor would
> "tea" or "gun" or "war" or "hat".
>
> For the MySQL 3 mode we again trim out short words before putting them
> to the search engine; you can override this by setting the variable
> $wgDBminWordLen in LocalSettings.php. I don't think the check is done in
> MySQL 4 mode, since it works differently using a more advanced mode in
> the MySQL engine, but I'm not sure offhand. However you still may need
> to adjust MySQL itself, see:
> http://dev.mysql.com/doc/mysql/en/Fulltext_Fine-tuning.html
>
> -- brion vibber (brion @ pobox.com)
>
>
> ------------------------------------------------------------------------
>
> _______________________________________________
> MediaWiki-l mailing list
> MediaWiki-l@Wikimedia.org
> http://mail.wikipedia.org/mailman/listinfo/mediawiki-l