Mailing List Archive

Stemming and Term/TermQuery
Hi,

It's my 2nd day looking into KinoSearch, so I'm sorry if I'm missing
something obvious, but the behavior I'm seeing is certainly weird.

For debugging purposes, I've minimized the index to just one field
('title'), and I'm doing a series of test queries against the same
invindex. The first series is done using a simple query call:
my $hits = $searcher->search(query => 'organic');

The second series is done using TermQuery:
my $term = KinoSearch::Index::Term->new(title => 'organic');
my $by_title = KinoSearch::Search::TermQuery->new(term => $term);
my $hits = $searcher->search(query => $by_title);

I expect both series to produce the same results, since there's only
one field indexed per document. However, the output is different for
some search terms:

Test searches:

cotton: 10 results
bags: 29 results
organic: 18 results
bamboo: 7 results
clothes: 7 results

Test term searches:

cotton: 10 results
bags: 0 results
organic: 0 results
bamboo: 7 results
clothes: 0 results

When I remove the stemmer (KinoSearch::Analysis::Stemmer->new(language
=> 'en')) from the list of analyzers for both index and search, the
results of both series are the same.

Am I doing something wrong, or is this an actual bug? (The examples
below are done with 0.15, but I've tried 0.20_04, and it seemed to
have the same problem.)

Thanks.

--
-----------------------------------------------------
Evaldas Imbrasas
http://www.imbrasas.com
Stemming and Term/TermQuery [ In reply to ]
On Aug 14, 2007, at 4:19 PM, Evaldas Imbrasas wrote:

> The first series is done using a simple query call:
> my $hits = $searcher->search(query => 'organic');
>
> The second series is done using TermQuery:
> my $term = KinoSearch::Index::Term->new(title => 'organic');
> my $by_title = KinoSearch::Search::TermQuery->new(term => $term);
> my $hits = $searcher->search(query => $by_title);
>
> I expect both series to produce the same results, since there's only
> one field indexed per document.

They will not. The one passing through the Searcher is receiving
additional processing -- crucially, it is being passed through an
Analyzer. In the first you are searching for 'organ', which is in
the index. In the second, you are searching for 'organic', which is
not.

> However, the output is different for
> some search terms:
>
> Test searches:
>
> cotton: 10 results
> bags: 29 results
> organic: 18 results
> bamboo: 7 results
> clothes: 7 results
>
> Test term searches:
>
> cotton: 10 results
> bags: 0 results
> organic: 0 results
> bamboo: 7 results
> clothes: 0 results

Try out each of these terms at <http://snowball.tartarus.org/
demo.php>. The ones where the stemmed output is identical to the
input produce identical results.

PS to the list regarding my continuing absence...

I mentioned in a post a little while ago that for the contract job
I've been working on, testing had begun and the project lead had
left. Testing has gone about as well as we might have expected.
However, the absence of the project lead has made our troubleshooting
significantly less efficient, and I have had to step it up to
compensate as best I can. There is a lot of money at stake for a lot
of people, and I intend to continue pouring 100% of my efforts into
this job until it is certain that we are free and clear. I
appreciate your continuing patience with this pause.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
Stemming and Term/TermQuery [ In reply to ]
On 8/14/07, Marvin Humphrey <marvin@rectangular.com> wrote:
>
> On Aug 14, 2007, at 4:19 PM, Evaldas Imbrasas wrote:
>
> > The first series is done using a simple query call:
> > my $hits = $searcher->search(query => 'organic');
> >
> > The second series is done using TermQuery:
> > my $term = KinoSearch::Index::Term->new(title => 'organic');
> > my $by_title = KinoSearch::Search::TermQuery->new(term => $term);
> > my $hits = $searcher->search(query => $by_title);
> >
> > I expect both series to produce the same results, since there's only
> > one field indexed per document.
>
> They will not. The one passing through the Searcher is receiving
> additional processing -- crucially, it is being passed through an
> Analyzer. In the first you are searching for 'organ', which is in
> the index. In the second, you are searching for 'organic', which is
> not.

I see. Is there a way to make the search term in example #2 receive
the same additional processing as in #1?

What I'm trying to do is use several fields to filter the product
search results. Filtering by category or company can be done by simply
stuffing category and company IDs for each indexed item into their
respective fields - they're just numbers, so they're not affected by
the stemming issue. I do the same with the product tags as well, but
apparently tags need to be stemmed before they're passed on to the
filter. (If using such setup is not the optimal way to do the search
filtering, I'd appreciate your advice on this. Not counting the
stemming issue, it seemed to be working just fine though.)

A bonus question - for a new system, would you recommend going with
0.15 or 0.20_04 version?

Thanks for your help Marvin.

--
-----------------------------------------------------
Evaldas Imbrasas
http://www.imbrasas.com
Stemming and Term/TermQuery [ In reply to ]
On 8/14/07, Evaldas Imbrasas <evaldas@imbrasas.com> wrote:
> > > The first series is done using a simple query call:
> > > my $hits = $searcher->search(query => 'organic');
> > >
> > > The second series is done using TermQuery:
> > > my $term = KinoSearch::Index::Term->new(title => 'organic');
> > > my $by_title = KinoSearch::Search::TermQuery->new(term => $term);
> > > my $hits = $searcher->search(query => $by_title);
> > >
> > > I expect both series to produce the same results, since there's only
> > > one field indexed per document.
> >
> > They will not. The one passing through the Searcher is receiving
> > additional processing -- crucially, it is being passed through an
> > Analyzer. In the first you are searching for 'organ', which is in
> > the index. In the second, you are searching for 'organic', which is
> > not.
>
> I see. Is there a way to make the search term in example #2 receive
> the same additional processing as in #1?

Replying to myself so that the solution is saved in the archives. In
0.15, this issue can be solved using QueryParser:

my $title_parser = KinoSearch::QueryParser::QueryParser->new(
analyzer => $analyzer,
fields => ['title'],
default_boolop => 'AND',
);
my $by_title = $tag_parser->parse('organic');
$bool_query->add_clause(query => $by_title, occur => 'MUST');

--
-----------------------------------------------------
Evaldas Imbrasas
http://www.imbrasas.com