Mailing List Archive

Typos in queries
Hello,

I was wondering if there is already a method to return results when the
query contains typos.

For exemple the user searches for "nokopol" and we have a document that
has "nikopol" in the title and in the body, I'd like to be able to return
it.

I'd like to match for typos only on titles and eventually a few other
fields (for performance reason I don't think matching typos in body text
is appropriate).

Google has such a feature and I think they are doing it with their
statistical data (user usually correct typos and the recognize this).

However as I don't have statistical data and I'd like to provide this ootb
I was thinking I might go with the Levenshtein distance
(http://en.wikipedia.org/wiki/Levenshtein_distance) and fortunatly there
is a perl module that does this (thanks cpan).

So my question is, is there already something I missed included in
KinoSearch (stems are quite different as far as I can tell) that does that
?

And if not, does including that Levenshtein distance calculation to return
"corrected" result set make sense at first sight ?

Thanks

Eriam








_______________________________________________
KinoSearch mailing list
KinoSearch@rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch
Re: Typos in queries [ In reply to ]
On Thu, May 22, 2008 at 12:12 AM, Eriam Schaffter
<eriam@mediavirtuel.com> wrote:
> I was wondering if there is already a method to return results when the
> query contains typos.

So far as I know, there is not yet such a method.

Because inverted indexes are keyed by the word you are searching for,
using the Levenshtein distance directly is difficult. You would have
to calculate the distance from each word in the query to each word in
your index. This would only work if your index has a limited
vocabulary, you have a lot of extra processing power, or you come up
with an elegant means of caching your results.

I would recommend an easier solution: use an existing spell checker
like Text::Aspell to 'suggest' spelling corrections, and then
construct a query with these suggestions as part of a weighted Boolean
query. If I recall correctly, you have to hack libaspell if you want
it to return the distance, but otherwise this should work well.

Good luck!

Nathan Kurz
nate@verse.com

_______________________________________________
KinoSearch mailing list
KinoSearch@rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch
Re: Typos in queries [ In reply to ]
On Thu, May 22, 2008 at 12:44 AM, Nathan Kurz <nate@verse.com> wrote:
> If I recall correctly, you have to hack libaspell if you want
> it to return the distance, but otherwise this should work well.

I found my hacked up version of libaspell. The file you have to
change is: aspell-0.60.5/modules/speller/default/suggest.cpp

Here are the changes I made, although I'd suggest that you make your
own patch as I recall doing this just as an undebugged proof of
concept:

--- suggest.cpp.orig 2007-06-05 18:12:58.000000000 -0600
+++ suggest.cpp 2007-06-05 22:50:49.000000000 -0600
@@ -1219,6 +1227,9 @@
COUT << i->word << '\t' << i->score
<< '\t' << i->soundslike << "\n";
# endif
+ char score_text[16];
+ sprintf(score_text, ":%d", i->score);
+
if (i->repl_list != 0) {
String::size_type pos;
do {
@@ -1228,13 +1239,13 @@
? (bool)sp->check(*dup_pair.first)
: (sp->check((String)dup_pair.first->substr(0,pos))
&& sp->check((String)dup_pair.first->substr(pos+1))) ))
- near_misses_final->push_back(*dup_pair.first);
+ near_misses_final->push_back((String)dup_pair.first->append(score_te
xt));
} while (i->repl_list->adv());
} else {
fix_case(i->word);
dup_pair = duplicates_check.insert(i->word);
if (dup_pair.second )
- near_misses_final->push_back(*dup_pair.first);
+ near_misses_final->push_back((String)dup_pair.first->append(score_text
));
}
}
}

_______________________________________________
KinoSearch mailing list
KinoSearch@rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch
Re: Typos in queries [ In reply to ]
On 05/22/2008 01:44 AM, Nathan Kurz wrote:

> I would recommend an easier solution: use an existing spell checker
> like Text::Aspell to 'suggest' spelling corrections, and then
> construct a query with these suggestions as part of a weighted Boolean
> query. If I recall correctly, you have to hack libaspell if you want
> it to return the distance, but otherwise this should work well.

Look too at Search::Tools::Spellcheck which wraps Text::Aspell especially for search apps.
IIRC, libaspell allows you too also build custom dictionaries if your corpus requires it.

--
Peter Karman . peter@peknet.com . http://peknet.com/


_______________________________________________
KinoSearch mailing list
KinoSearch@rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch