Mailing List Archive

Tuning MoreLikeThis scoring algorithm
I'd like to have suggestions on changing the scoring algorithm
of MoreLikeThis.

When I feed the identical string as the content of a document in the index
to MoreLikeThis.like("field", new StringReader(docContent)),
I get a score less than 1.0 (0.944 in one of my test cases) that I expect.

What is the easiest way to change this so that the score is 1.0 when
all the terms in the query matches with all the terms of a document?
The score should be less than 1.0 if the query contains only a part of the terms
from the document. (Needless to say, the score should also be less than 1.0
if only part of the query terms are found in the document.)

For my purpose, I don't need a sophisticated search relevancy technique
like TF-IDF. I'd like it work faster/cheaper.

I tried using BooleanSimilarity, but that ended up returning a score above 1.0.
Also the score is the same as long as all the terms in the query are matched.
For example, querying "quick brown fox" and "quick brown" yield the same score
against
the doc that has the famous test string.


TK


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Tuning MoreLikeThis scoring algorithm [ In reply to ]
See https://cwiki.apache.org/confluence/display/LUCENE/ScoresAsPercentages
which has some broken nabble links, but is still valid.

TLDR: Scoring just doesn't work the way you think. Don't try to
interpret it as an absolute value, it is a relative one.

On Fri, May 28, 2021 at 1:36 PM TK Solr <tksolrml@sonic.net> wrote:
>
> I'd like to have suggestions on changing the scoring algorithm
> of MoreLikeThis.
>
> When I feed the identical string as the content of a document in the index
> to MoreLikeThis.like("field", new StringReader(docContent)),
> I get a score less than 1.0 (0.944 in one of my test cases) that I expect.
>
> What is the easiest way to change this so that the score is 1.0 when
> all the terms in the query matches with all the terms of a document?
> The score should be less than 1.0 if the query contains only a part of the terms
> from the document. (Needless to say, the score should also be less than 1.0
> if only part of the query terms are found in the document.)
>
> For my purpose, I don't need a sophisticated search relevancy technique
> like TF-IDF. I'd like it work faster/cheaper.
>
> I tried using BooleanSimilarity, but that ended up returning a score above 1.0.
> Also the score is the same as long as all the terms in the query are matched.
> For example, querying "quick brown fox" and "quick brown" yield the same score
> against
> the doc that has the famous test string.
>
>
> TK
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Tuning MoreLikeThis scoring algorithm [ In reply to ]
Thank you for the information, Robert.

The argument against the normalized score make sense for the regular
kind of search where queries are much shorter than the documents.

But MLT is a document vs document search. Can't we define 100% match as
all terms are found in both documents at the same number of count each?
That would basically be a cosine similarity between the two documents, I think.

TK

On 5/28/21 6:27 PM, Robert Muir wrote:
> See https://cwiki.apache.org/confluence/display/LUCENE/ScoresAsPercentages
> which has some broken nabble links, but is still valid.
>
> TLDR: Scoring just doesn't work the way you think. Don't try to
> interpret it as an absolute value, it is a relative one.
>
> On Fri, May 28, 2021 at 1:36 PM TK Solr <tksolrml@sonic.net> wrote:
>> I'd like to have suggestions on changing the scoring algorithm
>> of MoreLikeThis.
>>
>> When I feed the identical string as the content of a document in the index
>> to MoreLikeThis.like("field", new StringReader(docContent)),
>> I get a score less than 1.0 (0.944 in one of my test cases) that I expect.
>>
>> What is the easiest way to change this so that the score is 1.0 when
>> all the terms in the query matches with all the terms of a document?
>> The score should be less than 1.0 if the query contains only a part of the terms
>> from the document. (Needless to say, the score should also be less than 1.0
>> if only part of the query terms are found in the document.)
>>
>> For my purpose, I don't need a sophisticated search relevancy technique
>> like TF-IDF. I'd like it work faster/cheaper.
>>
>> I tried using BooleanSimilarity, but that ended up returning a score above 1.0.
>> Also the score is the same as long as all the terms in the query are matched.
>> For example, querying "quick brown fox" and "quick brown" yield the same score
>> against
>> the doc that has the famous test string.
>>
>>
>> TK

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org