Mailing List Archive

a "fair" similarity
Hi,

as some of you may have noticed, Lucene prefers shorter documents over
longer ones, i.e. shorter documents get a higher ranking, even if the
ratio "matched terms / total terms in document" is the same.

For example, take these two artificial documents:

doc1: x 2 3 4 5 6 7 8 9 10
doc2: x x 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

When searching for "x" doc1 will get a higher ranking, even though "x"
makes up 1/10 of the terms in both documents.

Using this similarity implementation seems to "fix" that:

class MySim extends DefaultSimilarity {

public float lengthNorm(String fieldName, int numTerms) {
return (float)(1.0 / numTerms);
}

public float tf(float freq) {
return (float)freq;
}

}

It's basically just the default implementation with Math.sqrt() removed. Is
this the correct approach? Are there any problems to expect? I just tested
it with the documents cited above.

The use case is that I want to boost fields, e.g. "body:foo^2 title:blah".
This could lead to strange results if title is already preferred just
because it's shorter.

Regards
Daniel

--
http://www.danielnaber.de

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: a "fair" similarity [ In reply to ]
Daniel Naber wrote:
> Hi,
>
> as some of you may have noticed, Lucene prefers shorter documents over
> longer ones, i.e. shorter documents get a higher ranking, even if the
> ratio "matched terms / total terms in document" is the same.
>
> For example, take these two artificial documents:
>
> doc1: x 2 3 4 5 6 7 8 9 10
> doc2: x x 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
>
> When searching for "x" doc1 will get a higher ranking, even though "x"
> makes up 1/10 of the terms in both documents.

I think it depends upon what you want "similar" to mean. The shorter doc
thing comes from the "parsimony" concept, if I remember my Information Theory
correctly. In other words, the less data to get to a given result (1/10 "x"
in your example) the better. It sounds like you want doc1 and doc2 to be
considered exactly similar, at least for "x". Would you want doc3 below to be
treated the same way?

doc3: x 2 3 4 5 6 7 8 9 10
x 12 13 14 15 16 17 18 19 20
x 22 ... 30
x 32 ... 40
... 1000

In some situations, the appearance of "x" is more significant in doc1, because
hardly anything is there in the first place. I think that tends to be more
common in English prose, which may be why it's the default in Lucene.

I think your proposed formula would treat all docs, 1-3, the same. If that's
what you want, I'd say you're good to go.

--MDC

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: a "fair" similarity [ In reply to ]
Michael D. Curtin wrote:
> Daniel Naber wrote:
>
>> Hi,
>>
>> as some of you may have noticed, Lucene prefers shorter documents over
>> longer ones, i.e. shorter documents get a higher ranking, even if the
>> ratio "matched terms / total terms in document" is the same.

There's even more interesting kinds of "unfairness".

Suppose we have a document. We can turn the
document into a query in the obvious way (a set
of boolean SHOULD clauses with term frequencies
given by counts in the doc).

Lucene's IDF scaling is only applied to the query.
This is great for performance, because the doc vectors
remain stable as new docs are added.

Then, in general:

score(doc,doc) < score(doc,doc')

if IDF(doc) = doc'. That is, the inversely IDF-scaled
query matches a document better than the document itself.

- Bob Carpenter
Alias-i

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: a "fair" similarity [ In reply to ]
Hi,

I've tried this "fair" similarity with lucene 2.2 but it does not seems to
work.

I've attached the custom "MyFair" similarity to bith IndexWriter and
IndexSearcher.

Do you have any idea ?

Thanks a lot,

Fabrice


Daniel Naber-5 wrote:
>
> Hi,
>
> as some of you may have noticed, Lucene prefers shorter documents over
> longer ones, i.e. shorter documents get a higher ranking, even if the
> ratio "matched terms / total terms in document" is the same.
>
> For example, take these two artificial documents:
>
> doc1: x 2 3 4 5 6 7 8 9 10
> doc2: x x 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
>
> When searching for "x" doc1 will get a higher ranking, even though "x"
> makes up 1/10 of the terms in both documents.
>
> Using this similarity implementation seems to "fix" that:
>
> class MySim extends DefaultSimilarity {
>
> public float lengthNorm(String fieldName, int numTerms) {
> return (float)(1.0 / numTerms);
> }
>
> public float tf(float freq) {
> return (float)freq;
> }
>
> }
>
> It's basically just the default implementation with Math.sqrt() removed.
> Is
> this the correct approach? Are there any problems to expect? I just tested
> it with the documents cited above.
>
> The use case is that I want to boost fields, e.g. "body:foo^2 title:blah".
> This could lead to strange results if title is already preferred just
> because it's shorter.
>
> Regards
> Daniel
>
> --
> http://www.danielnaber.de
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
>

--
View this message in context: http://www.nabble.com/a-%22fair%22-similarity-tp5806739p14992681.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org