Mailing List Archive

Finding matching search terms
I have the same question that Marc asked last month:
http://www.rectangular.com/pipermail/kinosearch/2007-November/001314.html

How do you find out which of the query terms resulted in a particular
hit? Also, would it be possible to find out which of the terms
resulted in no hits? I'd hate to have to perform a separate search on
each term and then merge the results.

Thanks.

_______________________________________________
KinoSearch mailing list
KinoSearch@rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch
Re: Finding matching search terms [ In reply to ]
On Sun, Dec 30, 2007 at 04:03:04PM -0800, colossus forbin wrote:

> How do you find out which of the query terms resulted in a particular hit?

This information is not preserved during the scoring process. The scoring
apparatus feeds doc_num/score pairs into a HitCollector object; that's all
it ever sees. It doesn't know about what parts of a composite scorer
contributed to the score, nor how much they contributed.

The potential exists for us to sprinkle explain() methods throughout the
library, but these would be for debugging and demonstration rather than
production, as dealing with multi-tiered Explanation objects rather than
simple ScoreDocs would be expensive.

> Also, would it be possible to find out which of the terms resulted in no
> hits? I'd hate to have to perform a separate search on each term and then
> merge the results.

What is the final objective?

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/


_______________________________________________
KinoSearch mailing list
KinoSearch@rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch
Re: Finding matching search terms [ In reply to ]
On 12/30/07, Marvin Humphrey <marvin@rectangular.com> wrote:
> On Sun, Dec 30, 2007 at 04:03:04PM -0800, colossus forbin wrote:
>
> > How do you find out which of the query terms resulted in a particular hit?
>
> This information is not preserved during the scoring process. The scoring
> apparatus feeds doc_num/score pairs into a HitCollector object; that's all
> it ever sees. It doesn't know about what parts of a composite scorer
> contributed to the score, nor how much they contributed.
>
> The potential exists for us to sprinkle explain() methods throughout the
> library, but these would be for debugging and demonstration rather than
> production, as dealing with multi-tiered Explanation objects rather than
> simple ScoreDocs would be expensive.
>
> > Also, would it be possible to find out which of the terms resulted in no
> > hits? I'd hate to have to perform a separate search on each term and then
> > merge the results.
>
> What is the final objective?

I would like to determine if a user has mispelled a term and then
suggest a corrected version. If I was dealing with a single term or
was AND'ing the terms it would be much simpler to handle.

_______________________________________________
KinoSearch mailing list
KinoSearch@rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch
Re: Finding matching search terms [ In reply to ]
On Sun, Dec 30, 2007 at 07:00:32PM -0800, colossus forbin wrote:
> I would like to determine if a user has mispelled a term and then
> suggest a corrected version. If I was dealing with a single term or
> was AND'ing the terms it would be much simpler to handle.

The best algorithm for spellchecking -- the one used by Google and lots of
other people -- actually doesn't use search results as its primary source of
input. The way to handle this problem is to examine past search histories
and see how people corrected their queries. If "ciclops" is often followed
by "cyclops" and the query morphing stops there, then "cyclops" is probably
the right term and should be suggested.

This really should be implemented as a CPAN project entirely distinct from
KinoSearch. KS is an inverted indexer at its heart, but an inverted index
isn't what's needed; the magic is all in the preprocessing, and then it's a
dictionary lookup (probably via a hash a la Berkeley DB) for each term to
see if each it is associated with any "suggestions".

Though I know of this algo by word of mouth, I'm sure there are many
academic papers out there by now which could provide a recipe. Then you
need a large corpus. Those are hard to come by because of privacy concerns
(think AOL fiasco), but the Pirate Bay search query log might serve:
<http://thepiratebay.org/tor/3783572>.

Users of this theoretical library might either use a dictionary derived from
the Pirate Bay logs or might create their own by turning the module loose on
their own query logs.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/


_______________________________________________
KinoSearch mailing list
KinoSearch@rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch
Re: Finding matching search terms [ In reply to ]
On Dec 30, 2007 10:29 PM, Marvin Humphrey <marvin@rectangular.com> wrote:
> On Sun, Dec 30, 2007 at 07:00:32PM -0800, colossus forbin wrote:
> > I would like to determine if a user has mispelled a term and then
> > suggest a corrected version. If I was dealing with a single term or
> > was AND'ing the terms it would be much simpler to handle.
>
> The best algorithm for spellchecking -- the one used by Google and lots of
> other people -- actually doesn't use search results as its primary source of
> input. The way to handle this problem is to examine past search histories
> and see how people corrected their queries. If "ciclops" is often followed
> by "cyclops" and the query morphing stops there, then "cyclops" is probably
> the right term and should be suggested.
>
> This really should be implemented as a CPAN project entirely distinct from
> KinoSearch. KS is an inverted indexer at its heart, but an inverted index
> isn't what's needed; the magic is all in the preprocessing, and then it's a
> dictionary lookup (probably via a hash a la Berkeley DB) for each term to
> see if each it is associated with any "suggestions".
>
> Though I know of this algo by word of mouth, I'm sure there are many
> academic papers out there by now which could provide a recipe. Then you
> need a large corpus. Those are hard to come by because of privacy concerns
> (think AOL fiasco), but the Pirate Bay search query log might serve:
> <http://thepiratebay.org/tor/3783572>.
>
> Users of this theoretical library might either use a dictionary derived from
> the Pirate Bay logs or might create their own by turning the module loose on
> their own query logs.

This approach would make sense for a large site that expects a large
set of search terms, but what about a small site expecting a limited
number of terms, such as a small ecommerce site with a limited number
of products. If a user misspells a product name, it would make sense
to not only offer a corrected spelling, but perhaps suggest a similar
product which is carried by the site. These actions would be done at
run-time so it would be important to know which terms did not
contribute to any hits.

_______________________________________________
KinoSearch mailing list
KinoSearch@rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch
Re: Finding matching search terms [ In reply to ]
On 12/31/2007 11:21 AM, colossus forbin wrote:

> This approach would make sense for a large site that expects a large
> set of search terms, but what about a small site expecting a limited
> number of terms, such as a small ecommerce site with a limited number
> of products. If a user misspells a product name, it would make sense
> to not only offer a corrected spelling, but perhaps suggest a similar
> product which is carried by the site. These actions would be done at
> run-time so it would be important to know which terms did not
> contribute to any hits.
>

fwiw, I have implemented site-specific lectionaries using Search::Tools::SpellCheck. It
uses Text::Aspell underneath, and I just created a custom dictionary with the relevent
products, etc. Whether a given term in a query had any matches was ignored; what was
important was to offer suggestions of other queries to try.

--
Peter Karman . peter@peknet.com . http://peknet.com/


_______________________________________________
KinoSearch mailing list
KinoSearch@rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch
Re: Finding matching search terms [ In reply to ]
On Mon, Dec 31, 2007 at 09:21:24AM -0800, colossus forbin wrote:
> This approach would make sense for a large site that expects a large
> set of search terms, but what about a small site expecting a limited
> number of terms, such as a small ecommerce site with a limited number
> of products. If a user misspells a product name, it would make sense
> to not only offer a corrected spelling, but perhaps suggest a similar
> product which is carried by the site. These actions would be done at
> run-time so it would be important to know which terms did not
> contribute to any hits.

There's a method on Searcher, doc_freq(), which returns an integer telling you
how many documents a given term occurs in. Terms with a doc_freq of 0
have no chance of contributing to a score.

Searcher->doc_freq isn't public yet, but there's a good chance it will
be exposed in time -- document frequency information will always be
needed during the Query-to-Scorer compilation phase for weighting. I don't
think the API has changed since 0.05, so go ahead and use it, with caution. :)

Once you've identified your terms, you need to figure out what to suggest.
Aspell would help with ordinary words, but might not help with product names.
Maybe one of the edit-distance CPAN modules could help.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/

_______________________________________________
KinoSearch mailing list
KinoSearch@rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch
Re: Finding matching search terms [ In reply to ]
On Dec 31, 2007 1:57 PM, Marvin Humphrey <marvin@rectangular.com> wrote:
> On Mon, Dec 31, 2007 at 09:21:24AM -0800, colossus forbin wrote:
> > This approach would make sense for a large site that expects a large
> > set of search terms, but what about a small site expecting a limited
> > number of terms, such as a small ecommerce site with a limited number
> > of products. If a user misspells a product name, it would make sense
> > to not only offer a corrected spelling, but perhaps suggest a similar
> > product which is carried by the site. These actions would be done at
> > run-time so it would be important to know which terms did not
> > contribute to any hits.
>
> There's a method on Searcher, doc_freq(), which returns an integer telling you
> how many documents a given term occurs in. Terms with a doc_freq of 0
> have no chance of contributing to a score.
>
> Searcher->doc_freq isn't public yet, but there's a good chance it will
> be exposed in time -- document frequency information will always be
> needed during the Query-to-Scorer compilation phase for weighting. I don't
> think the API has changed since 0.05, so go ahead and use it, with caution. :)

Excellent! Thank you.

> Once you've identified your terms, you need to figure out what to suggest.
> Aspell would help with ordinary words, but might not help with product names.
> Maybe one of the edit-distance CPAN modules could help.

Aspell should work with product names as it allows the use of custom
dictionaries (not just supplemental ones). And it makes suggestions
based upon phonetic and edit-distance misspellings.

I noticed that xapian has support for misspelled terms, so maybe it
would make sense to have a recipe on the wiki on how to use kinosearch
with aspell.

_______________________________________________
KinoSearch mailing list
KinoSearch@rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch