Mailing List Archive

How can I boost score of a document if two consecutive terms match
Hi there,
I recently am developing my own search based on lucene, here is the use
case I am concerned about.

we have two documents in the index
a) content:new jersey
b) content:new year

the query is "he is celebrating the new year in jersey city".


If I tokenize the queries and add all terms to a boolean query, the
document will have the same score for the two queries, but what I want is
that b scores higher than a, what similarity should I use, or how can I
tweak the internal of Lucene to achieve the goal?

Please note that I cannot extract the phrase "new year" at compile time, so
it seems to me that PhraseQuery is not an approach.

Thank you very much for the help!
John
Re: How can I boost score of a document if two consecutive terms match [ In reply to ]
Hi John,

What you're looking for sounds like Solr's pf2 parameter (see
https://lucene.apache.org/solr/guide/8_6/the-extended-dismax-query-parser.html#extended-dismax-parameters
and
https://lucene.apache.org/solr/guide/8_6/the-dismax-query-parser.html#pf-phrase-fields-parameter
for details).

Basically, behind the scenes, it takes successive pairs of terms, and
treats them as boosted phrase query clauses. So, a query like "t1 t2 t3"
with a pf2 boost of 5 would become roughly:

t1 OR t2 OR t3 OR "t1 t2"^5 OR "t2 t3"^5

Alternatively, since it sounds like you want to boost matches where two
consecutive words are both present in the same document, rather than
requiring that they're present in order, you could parse the query to:

t1 OR t2 OR t2 OR (t1 AND t2)^5 OR (t2 AND t3)^5

Are you using a QueryParser implementation or are you just running the
query string through an Analyzer and producing your own BooleanQuery? If
the latter, you could directly produce the second query (wrapping the
nested AND queries in a BoostQuery).

Would that do what you want?

Michael

On Fri, Oct 30, 2020 at 2:15 PM YAN PAN <pan1996y@gmail.com> wrote:

> Hi there,
> I recently am developing my own search based on lucene, here is the use
> case I am concerned about.
>
> we have two documents in the index
> a) content:new jersey
> b) content:new year
>
> the query is "he is celebrating the new year in jersey city".
>
>
> If I tokenize the queries and add all terms to a boolean query, the
> document will have the same score for the two queries, but what I want is
> that b scores higher than a, what similarity should I use, or how can I
> tweak the internal of Lucene to achieve the goal?
>
> Please note that I cannot extract the phrase "new year" at compile time, so
> it seems to me that PhraseQuery is not an approach.
>
> Thank you very much for the help!
> John
>