Mailing List Archive: Using multiple analyzers

Hi there,

I'm running v0.15. I want to make use of multiple (2) analyzers, so
that I can benefit from stemming and stop-words for some fields (as
default behaviour), whilst benefiting from 'exact matches' in other
fields (by not stemming). I have the following code in my 'build
index' script :

#---

# The default analyzer.
my $stemmed_analyzer = KinoSearch::Analysis::PolyAnalyzer->new(
analyzers => [.
KinoSearch::Analysis::LCNormalizer->new( language => 'en' ),
KinoSearch::Analysis::Tokenizer->new( language => 'en' ),
KinoSearch::Analysis::Stopalizer->new( language => 'en' ),
KinoSearch::Analysis::Stemmer->new( language => 'en' )
]
);

my $unstemmed_analyzer = KinoSearch::Analysis::PolyAnalyzer->new(
analyzers => [.
KinoSearch::Analysis::LCNormalizer->new( language => 'en' ),
KinoSearch::Analysis::Tokenizer->new( language => 'en' ),
]
);

my $inv_indexer = KinoSearch::InvIndexer->new(
invindex => $index_dir,
analyzer => $stemmed_analyzer,
create => 1,
);

$inv_indexer->spec_field(
name => 'title',
);

$inv_indexer->spec_field(
name => 'title_unstemmed',
analyzer => $unstemmed_analyzer,
boost => 2,
);

$inv_indexer->spec_field(
name => 'content',
);

#---

In my 'search' script, I am using the following :

#---

my $stemmed_analyzer = KinoSearch::Analysis::PolyAnalyzer->new(
analyzers => [.
KinoSearch::Analysis::LCNormalizer->new( language => 'en' ),
KinoSearch::Analysis::Tokenizer->new( language => 'en' ),
KinoSearch::Analysis::Stopalizer->new( language => 'en' ),
KinoSearch::Analysis::Stemmer->new( language => 'en' )
]
);

my $query_parser = KinoSearch::QueryParser::QueryParser->new(
analyzer => $stemmed_analyzer,
fields => [ qw/ title title_unstemmed content / ],
default_boolop => 'AND',
);

#---

How do I tell it to use the unstemmed analyzer for the title_unstemmed
field? The docs emphasize the importance of using the same analyzer in
both stages (build and search) but I cannot seem to do that as
QueryParser can only take one analyzer.

Thanks,

Adam

On Aug 17, 2007, at 2:12 AM, Adam . wrote:

> I'm running v0.15. I want to make use of multiple (2) analyzers, so
> that I can benefit from stemming and stop-words for some fields (as
> default behaviour), whilst benefiting from 'exact matches' in other
> fields (by not stemming).

This is a good technique.

> How do I tell it to use the unstemmed analyzer for the title_unstemmed
> field? The docs emphasize the importance of using the same analyzer in
> both stages (build and search) but I cannot seem to do that as
> QueryParser can only take one analyzer.

In 0.15, you create two QueryParsers and OR the Query objects they
output together using a BooleanQuery. See BooleanQuery's docs for an
example using one QueryParser and one hand-rolled Query object.

In the devel branch, these extra acrobatics aren't necessary, since
QueryParser accepts a Schema and uses it to find the right Analyzer
for each field.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/