Mailing List Archive: Keyword boosting

I've got a searching problem which I know lots of other people have run
across too. We've got documents which have keywords (which we extract and
put into a 'keywords' field) and also have body text (which we put in a
'body' field.)

Lets say we search for "text retrieval". We want to find documents that
have "text retrieval" in the body OR in the keywords, but we want to weight
hits on the keywords more heavily. I can't boost the tokens in the index
base, so I have to do that through the query.

If I convert a query for phrase Q into this:
body:Q OR keywords:Q^n
does that do what I want?

How should I select the boost factor N? Are there negative consequences to
this strategy? Am I better off doing two queries and merging the results
myself?

--
Brian Goetz
Quiotix Corporation
brian@quiotix.com Tel: 650-843-1300 Fax: 650-324-8032

http://www.quiotix.com

--
To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>

Brian Goetz wrote:
> Lets say we search for "text retrieval". We want to find documents that
> have "text retrieval" in the body OR in the keywords, but we want to
> weight hits on the keywords more heavily. I can't boost the tokens in
> the index base, so I have to do that through the query.

Tokens in a keyword field will naturally tend to impact a hit more than
tokens in the body since the keyword field tends to be shorter, and
Lucene normalizes for the length of the field.

If that's not enough, in the latest CVS version you can boost each field
of a document separately.

I've been thinking through a re-design of the way Lucene does scoring,
both in order to provide an API so that folks can change the scoring,
and to provide more powerful scoring mechanisms. Stay tuned.

Doug

--
To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>