Mailing List Archive

Providing weights for individual terms in a query based on similarity to document terms
Hellooo,

Suppose a user enters ‘box of shoes’ in my search box. I have two documents
titled ‘box of clothes’ and ‘box of socks’. I’ve figured out through a
separate algorithm that ‘socks’ is more similar to ‘shoes’ than clothes.

I even have a numeric score for the similarity: for socks it’s 0.8 and for
clothes is 0.65

How can I feed this info to lucene to help it rank socks higher than
clothes?

I still want the usual tf-idf rules to apply. Ie’box’ and ‘of’ occur in a
lot of documents but ‘socks’ and ‘clothes’ are rarer so they should be
given more importance.

So I don’t want to have to overwrite the similarity class. I just want to
be able to pass in the info that ‘socks’ and ‘clothes’ are both kinda like
synonyms for shoes, but socks is more similar to shoes than clothes. May be
create a boost using the similarity score which doesn’t artificially boost
frequent / less important terms.

If I just provided them as regular synonyms, they they will both be
considered equal in weight.

Thanks.
Re: Providing weights for individual terms in a query based on similarity to document terms [ In reply to ]
I think what I'm looking for is to multiply the term frequency of each term
by the similarity score.

E.g for 'shoes', its an exact match, so tf * 1
For 'socks', similarity = 0.8, -> tf * 0.8
'Clothes', similarity = 0.65 -> tf * 0.65

Is there a way to achieve this w/ Lucene's API or do I need to extend the
similarity class myself?

On Fri, Jul 3, 2020 at 8:44 PM Ali Akhtar <ali@ali.actor> wrote:

> Hellooo,
>
> Suppose a user enters ‘box of shoes’ in my search box. I have two
> documents titled ‘box of clothes’ and ‘box of socks’. I’ve figured out
> through a separate algorithm that ‘socks’ is more similar to ‘shoes’ than
> clothes.
>
> I even have a numeric score for the similarity: for socks it’s 0.8 and for
> clothes is 0.65
>
> How can I feed this info to lucene to help it rank socks higher than
> clothes?
>
> I still want the usual tf-idf rules to apply. Ie’box’ and ‘of’ occur in a
> lot of documents but ‘socks’ and ‘clothes’ are rarer so they should be
> given more importance.
>
> So I don’t want to have to overwrite the similarity class. I just want to
> be able to pass in the info that ‘socks’ and ‘clothes’ are both kinda like
> synonyms for shoes, but socks is more similar to shoes than clothes. May be
> create a boost using the similarity score which doesn’t artificially boost
> frequent / less important terms.
>
> If I just provided them as regular synonyms, they they will both be
> considered equal in weight.
>
> Thanks.
>
>