Hi,
as some of you may have noticed, Lucene prefers shorter documents over
longer ones, i.e. shorter documents get a higher ranking, even if the
ratio "matched terms / total terms in document" is the same.
For example, take these two artificial documents:
doc1: x 2 3 4 5 6 7 8 9 10
doc2: x x 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
When searching for "x" doc1 will get a higher ranking, even though "x"
makes up 1/10 of the terms in both documents.
Using this similarity implementation seems to "fix" that:
class MySim extends DefaultSimilarity {
public float lengthNorm(String fieldName, int numTerms) {
return (float)(1.0 / numTerms);
}
public float tf(float freq) {
return (float)freq;
}
}
It's basically just the default implementation with Math.sqrt() removed. Is
this the correct approach? Are there any problems to expect? I just tested
it with the documents cited above.
The use case is that I want to boost fields, e.g. "body:foo^2 title:blah".
This could lead to strange results if title is already preferred just
because it's shorter.
Regards
Daniel
--
http://www.danielnaber.de
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
as some of you may have noticed, Lucene prefers shorter documents over
longer ones, i.e. shorter documents get a higher ranking, even if the
ratio "matched terms / total terms in document" is the same.
For example, take these two artificial documents:
doc1: x 2 3 4 5 6 7 8 9 10
doc2: x x 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
When searching for "x" doc1 will get a higher ranking, even though "x"
makes up 1/10 of the terms in both documents.
Using this similarity implementation seems to "fix" that:
class MySim extends DefaultSimilarity {
public float lengthNorm(String fieldName, int numTerms) {
return (float)(1.0 / numTerms);
}
public float tf(float freq) {
return (float)freq;
}
}
It's basically just the default implementation with Math.sqrt() removed. Is
this the correct approach? Are there any problems to expect? I just tested
it with the documents cited above.
The use case is that I want to boost fields, e.g. "body:foo^2 title:blah".
This could lead to strange results if title is already preferred just
because it's shorter.
Regards
Daniel
--
http://www.danielnaber.de
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org