Mailing List Archive

update: boolean search is in CVS
Dear fellow programmers,

Boolean search has been implemented and committed to CVS. You can now use
the keywords "and", "or" and "not" in your queries such as "Harry and not
Potter". The and operator is implicit so you also omit it as in "Harry not
Potter". So the query "group theory" is interpreted as "group AND theory".
The priority of the "not" is the highest, then the "and" and finally the
"or". You can use brackets to alter this, for example as in "Harry and (
Potter or Sally )".

A few problems remain:
1. The index indexes only words with more than three characters. So searches
for small words will always give empty result. This means that a query like
"World war" which is interpreted as "World AND war" give always an empty
result. I've made this a syntax error to warn the innocent user. However,
the minimun index word size can be changed to 2, but we would have to
recompile MySQL for that and rebuild the indexes.
2. The index has a fixed list of stopwords that also always give an empty
query result. This can also be changed in the source code. The MySQL
matching algorithm also takes into account that certain words are more
critical than others. While this more a feauture than a bug, it also means
that searches for very common words actually result in a very small number
of pages, or even none at all. Als this behavior can be changed by setting
some constants in the source and recompiling.
3. If you have a complex boolean query the search box seems a bit small. As
this is a lay-out matter I leave this to Magnus to change. :-)

Finally, I noticed that the search performs better if the title contains no
"_"s because the index considers "Larry_Sanger" as one word that does not
match very well with either "Larry" or "Sanger". To remedy this I have
change the code a bit so that new titles are always stored in the indexed
column without underscore's. However, for the old pages you still need to
update this by hand. I have included the statement in 'updSchema.sql'.

Enjoy,

-- Jan Hidders
Re: update: boolean search is in CVS [ In reply to ]
Wow, thanks Jan!

Larry

On Sun, 17 Feb 2002, Jan Hidders wrote:

> Dear fellow programmers,
>
> Boolean search has been implemented and committed to CVS. You can now use
> the keywords "and", "or" and "not" in your queries such as "Harry and not
> Potter". The and operator is implicit so you also omit it as in "Harry not
> Potter". So the query "group theory" is interpreted as "group AND theory".
> The priority of the "not" is the highest, then the "and" and finally the
> "or". You can use brackets to alter this, for example as in "Harry and (
> Potter or Sally )".
>
> A few problems remain:
> 1. The index indexes only words with more than three characters. So searches
> for small words will always give empty result. This means that a query like
> "World war" which is interpreted as "World AND war" give always an empty
> result. I've made this a syntax error to warn the innocent user. However,
> the minimun index word size can be changed to 2, but we would have to
> recompile MySQL for that and rebuild the indexes.
> 2. The index has a fixed list of stopwords that also always give an empty
> query result. This can also be changed in the source code. The MySQL
> matching algorithm also takes into account that certain words are more
> critical than others. While this more a feauture than a bug, it also means
> that searches for very common words actually result in a very small number
> of pages, or even none at all. Als this behavior can be changed by setting
> some constants in the source and recompiling.
> 3. If you have a complex boolean query the search box seems a bit small. As
> this is a lay-out matter I leave this to Magnus to change. :-)
>
> Finally, I noticed that the search performs better if the title contains no
> "_"s because the index considers "Larry_Sanger" as one word that does not
> match very well with either "Larry" or "Sanger". To remedy this I have
> change the code a bit so that new titles are always stored in the indexed
> column without underscore's. However, for the old pages you still need to
> update this by hand. I have included the statement in 'updSchema.sql'.
>
> Enjoy,
>
> -- Jan Hidders
>
>
> _______________________________________________
> Wikitech-l mailing list
> Wikitech-l@ross.bomis.com
> http://ross.bomis.com/mailman/listinfo/wikitech-l
>