Mailing List Archive

TF in MoreLikeThis
Hi,

I was looking at Lucene's code for MoreLikeThis, specifically this line:
https://github.com/apache/lucene/blob/69b040fc6292ac47d7f7fc8bc3b7fd601794e54b/lucene/queries/src/java/org/apache/lucene/queries/mlt/MoreLikeThis.java#L640

It looks like in ClassicSimilarity, TF is a square root, but in the code TF
is used without the ClassicSimilarity::tf() function called. Is that a bug
- it will make TF have a disproportionately higher weight compared to IDF?

--Petko
Re: TF in MoreLikeThis [ In reply to ]
From a quick look, your suggestion of passing the term frequency to
TFIDFSimilarity#tf makes sense.

Would you like to contribute this change? You can find contributing
guidelines here:
https://github.com/apache/lucene/blob/main/CONTRIBUTING.md.

On Thu, Mar 31, 2022 at 11:46 PM Petko Minkov <pminkov@gmail.com> wrote:
>
> Hi,
>
> I was looking at Lucene's code for MoreLikeThis, specifically this line:
> https://github.com/apache/lucene/blob/69b040fc6292ac47d7f7fc8bc3b7fd601794e54b/lucene/queries/src/java/org/apache/lucene/queries/mlt/MoreLikeThis.java#L640
>
> It looks like in ClassicSimilarity, TF is a square root, but in the code TF
> is used without the ClassicSimilarity::tf() function called. Is that a bug
> - it will make TF have a disproportionately higher weight compared to IDF?
>
> --Petko



--
Adrien

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: TF in MoreLikeThis [ In reply to ]
Yeah, I'll be happy to. I'll try to get a patch out soon.

On Fri, Apr 1, 2022 at 9:31 AM Adrien Grand <jpountz@gmail.com> wrote:

> From a quick look, your suggestion of passing the term frequency to
> TFIDFSimilarity#tf makes sense.
>
> Would you like to contribute this change? You can find contributing
> guidelines here:
> https://github.com/apache/lucene/blob/main/CONTRIBUTING.md.
>
> On Thu, Mar 31, 2022 at 11:46 PM Petko Minkov <pminkov@gmail.com> wrote:
> >
> > Hi,
> >
> > I was looking at Lucene's code for MoreLikeThis, specifically this line:
> >
> https://github.com/apache/lucene/blob/69b040fc6292ac47d7f7fc8bc3b7fd601794e54b/lucene/queries/src/java/org/apache/lucene/queries/mlt/MoreLikeThis.java#L640
> >
> > It looks like in ClassicSimilarity, TF is a square root, but in the code
> TF
> > is used without the ClassicSimilarity::tf() function called. Is that a
> bug
> > - it will make TF have a disproportionately higher weight compared to
> IDF?
> >
> > --Petko
>
>
>
> --
> Adrien
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
Re: TF in MoreLikeThis [ In reply to ]
Sorry for the delay, but better late than never :). I put up a PR here:
https://github.com/apache/lucene/pull/940.

--Petko

On Fri, Apr 1, 2022 at 10:11 AM Petko Minkov <pminkov@gmail.com> wrote:

> Yeah, I'll be happy to. I'll try to get a patch out soon.
>
> On Fri, Apr 1, 2022 at 9:31 AM Adrien Grand <jpountz@gmail.com> wrote:
>
>> From a quick look, your suggestion of passing the term frequency to
>> TFIDFSimilarity#tf makes sense.
>>
>> Would you like to contribute this change? You can find contributing
>> guidelines here:
>> https://github.com/apache/lucene/blob/main/CONTRIBUTING.md.
>>
>> On Thu, Mar 31, 2022 at 11:46 PM Petko Minkov <pminkov@gmail.com> wrote:
>> >
>> > Hi,
>> >
>> > I was looking at Lucene's code for MoreLikeThis, specifically this line:
>> >
>> https://github.com/apache/lucene/blob/69b040fc6292ac47d7f7fc8bc3b7fd601794e54b/lucene/queries/src/java/org/apache/lucene/queries/mlt/MoreLikeThis.java#L640
>> >
>> > It looks like in ClassicSimilarity, TF is a square root, but in the
>> code TF
>> > is used without the ClassicSimilarity::tf() function called. Is that a
>> bug
>> > - it will make TF have a disproportionately higher weight compared to
>> IDF?
>> >
>> > --Petko
>>
>>
>>
>> --
>> Adrien
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>