Mailing List Archive

Re: [Wikipedia-l] Re: Searching with common words
On Fri, Aug 16, 2002 at 10:53:17PM +0200, Axel Boldt wrote:
> Andre writes:
>
> >Could the search feature be changed such that common words, rather
> >than blocking the whole search, are removed from it
>
> I would like that very much. I just searched for "leap second" and
> found nothing, because "second" is a stop word...
>
> The internal mysql search engine already does the right thing and
> omits stop words silently, but we are using it separately for every
> word, which causes the problem.

The problem with that was that MySQL's scoring was based more on OR'ing the
search words. So a search for "world war" would give you a result that is
similar to what you now get if you do "world OR war".

It surprises me a bit that "second" is a stopword, btw. That reminds me. As
far as I could tell there is only an English stopword list in MySQL. So
what do we do for the non-English Wikipedias? If we want to give them each
their own stopword list then we need to recompile MySQL for each of them and
given the all their own MySQL server. Or are we going to have one server
with an empty stopword list and see that doesn't let the fulltext index
explode in size?

Another thing is that if a word appears in more than 50% of the documents
then its search result will also be empty. We cannot filter those out in the
search text because we don't know which words these are.

Perhaps its time to think about rolling our own fulltext indexing
mechanism?

-- Jan Hidders
Re: Re: [Wikipedia-l] Re: Searching with common words [ In reply to ]
> Perhaps its time to think about rolling our own fulltext
> indexing mechanism?

Short of that, there are still a few tricks we can do. First,
it was earier than I thought it would be to add MySQL's stoplist
to the code, so that's done now. Conversely, if you or others
want to customize the list itself, you can edit the file
"FulltextStoplist.php" in the code, and I can recompile MySQL
to use our cutomized one.

Secondly, MySQL 4.0 has a much fancier fulltext search feature,
so we may want to switch over at some point when 4.0 is "blessed".
It's already a pretty stable product from what I hear, it's just
not "officially" stable.
Re: Re: [Wikipedia-l] Re: Searching with common words [ In reply to ]
On Fri, Aug 16, 2002 at 06:08:11PM -0700, lcrocker@nupedia.com wrote:
>
> [...] Conversely, if you or others
> want to customize the list itself, you can edit the file
> "FulltextStoplist.php" in the code, and I can recompile MySQL
> to use our cutomized one.

Actually, out of curiosity I wrote a small PHP script that fills a table
with information about which words are used how many times and in how many
articles. It needs some refining (it runs over all pages in table cur and it
uses the regexp \w+ to find words) but with it we could for example determine a
list of words that is used in more than 50% of the pages. Those are the
search words that MySQL ignores anyway. It would also give us a quick and
dirty way to determine the stopword list for the non-English Wikipedias.

I'm running it at the moment on the dump from May 20.

-- Jan Hidders