Mailing List Archive

Overriding Similarity
Hey Luceners,

I am doing some documentation on scoring and I am interested in use
cases people have for overriding the DefaultSimilarity. If you can
share what you did and why you did it, it would be much appreciated.

For example, Daniel Naber posted his at: http://www.gossamer-
threads.com/lists/lucene/java-user/38967?search_string=Similarity;#38967

Thanks,
Grant

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Overriding Similarity [ In reply to ]
: I am doing some documentation on scoring and I am interested in use
: cases people have for overriding the DefaultSimilarity. If you can
: share what you did and why you did it, it would be much appreciated.

I touched on this a little bit when i commited SweetSpotSimilarity...

http://www.nabble.com/Re%3A-SweetSpotSimiliarity-p4536312.html

...really any situation where you know more about your data then just that
it's "text" is a situation where it *might* make sense to to override your
SImilarity method.


-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Overriding Similarity [ In reply to ]
I had a situation where I was only interested in whether the term was
there or not (not how many times), and I didn't want to penalize long
fields. So I wrote a Similariy subclass where I overrided the
following methods as this:

public float lengthNorm(String fieldName, int numTerms) {
return numTerms > 0 ? 1.0f : 0.0f;
}

public float tf(float freq) {
return freq > 0 ? 1.0f : 0.0f;
}

And then I made this subclass the default similarity. It worked well
for tf but not for lengthNorm. The reason appears to be that the
TermScorer class does not call lengthNorm, but instead uses a cache
implemented as an static array in Similarity, made available through
static methods in Similarity. Since TermScorer calls these static
methods in Similarity, changing the default similarity has no effect
in this regard. So I ended up having to customize the code of core
lucene by changing the following code in Similarity:

static {
for (int i = 0; i < 256; i++)
NORM_TABLE[i] = 1.0f; //Originally: NORM_TABLE[i] =
SmallFloat.byte315ToFloat((byte)i);
}

This worked well, but I had hoped not having to change core lucene, so
if anyone has any other/better solution, I would appreciate some tips.

MHH


> : I am doing some documentation on scoring and I am interested in use
> : cases people have for overriding the DefaultSimilarity. If you can
> : share what you did and why you did it, it would be much appreciated.
>
> I touched on this a little bit when i commited SweetSpotSimilarity...
>
> http://www.nabble.com/Re%3A-SweetSpotSimiliarity-p4536312.html
>
> ...really any situation where you know more about your data then just that
> it's "text" is a situation where it *might* make sense to to override your
> SImilarity method.
>
>
> -Hoss
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Overriding Similarity [ In reply to ]
: And then I made this subclass the default similarity. It worked well
: for tf but not for lengthNorm. The reason appears to be that the
: TermScorer class does not call lengthNorm, but instead uses a cache

Acctually, the lengthNorm method is used by the IndexWriter; it compresses
the float returned by lengthNorm into a representation that uses a single
byte, and writes it to a file (one per field) which is exposed by
IndexReader.norms(field) for use in the Scorers.

: NORM_TABLE[i] = 1.0f; //Originally: NORM_TABLE[i] =
: SmallFloat.byte315ToFloat((byte)i);

that norm table is just used as a cache of mappings from the "byte
encoded" values to the nearest float value so that Scorers don't need to
call SmallFloat.byte315ToFloat((byte)i) everytime.

If you use Similarity.setDefault (or IndexWriter.setSimilarity) before
building your index you shouldn't need that change.




-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Overriding Similarity [ In reply to ]
Ah, I see, I should of course use the same similarity during indexing
and searching. Many thanks!

On 20/08/06, Chris Hostetter <hossman_lucene@fucit.org> wrote:
> : And then I made this subclass the default similarity. It worked well
> : for tf but not for lengthNorm. The reason appears to be that the
> : TermScorer class does not call lengthNorm, but instead uses a cache
>
> Acctually, the lengthNorm method is used by the IndexWriter; it compresses
> the float returned by lengthNorm into a representation that uses a single
> byte, and writes it to a file (one per field) which is exposed by
> IndexReader.norms(field) for use in the Scorers.
>
> : NORM_TABLE[i] = 1.0f; //Originally: NORM_TABLE[i] =
> : SmallFloat.byte315ToFloat((byte)i);
>
> that norm table is just used as a cache of mappings from the "byte
> encoded" values to the nearest float value so that Scorers don't need to
> call SmallFloat.byte315ToFloat((byte)i) everytime.
>
> If you use Similarity.setDefault (or IndexWriter.setSimilarity) before
> building your index you shouldn't need that change.
>
>
>
>
> -Hoss
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: [ANN] Overriding Similarity [ In reply to ]
Hello

The original message explaining why I wanted to overide DefaultSimilarity is
here :

http://www.nabble.com/forum/ViewPost.jtp?post=9268781&framed=y
http://www.nabble.com/forum/ViewPost.jtp?post=9268781&framed=y

Si I overided DefaultSimilarity like this :

package org.apache.lenya.lucene.index;

import org.apache.lucene.search.DefaultSimilarity;

public class PersonalSimilarity extends DefaultSimilarity {
public float idf(int docFreq, int numDocs)
{
return 1;
}

}

And I use it like this :

PersonalSimilarity ds = new PersonalSimilarity();
searcher.setSimilarity(ds);

After that, all the hit have the same score.

But my new question is : do I need to use this PersonalSimilarity for the
IndexWriter because currently it yet works fine only in the Searcher.

--
View this message in context: http://www.nabble.com/Overriding-Similarity-tf2128934.html#a9269838
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: [ANN] Overriding Similarity [ In reply to ]
If you modify all scores to 100%, why can't you just ignore them?

Anyhow, if all you want to modify is idf(), using the modified similarity
class only at search time is ok. For further pointers and explanations take
a look at http://lucene.apache.org/java/docs/scoring.html

Regards,
Doron

jreeman <mimounl@hotmail.com> wrote on 02/03/2007 05:49:23:

>
> Hello
>
> The original message explaining why I wanted to overide DefaultSimilarity
is
> here :
>
> http://www.nabble.com/forum/ViewPost.jtp?post=9268781&framed=y
> http://www.nabble.com/forum/ViewPost.jtp?post=9268781&framed=y
>
> Si I overided DefaultSimilarity like this :
>
> package org.apache.lenya.lucene.index;
>
> import org.apache.lucene.search.DefaultSimilarity;
>
> public class PersonalSimilarity extends DefaultSimilarity {
> public float idf(int docFreq, int numDocs)
> {
> return 1;
> }
>
> }
>
> And I use it like this :
>
> PersonalSimilarity ds = new PersonalSimilarity();
> searcher.setSimilarity(ds);
>
> After that, all the hit have the same score.
>
> But my new question is : do I need to use this PersonalSimilarity for the
> IndexWriter because currently it yet works fine only in the Searcher.
>
> --
> View this message in context: http://www.nabble.com/Overriding-
> Similarity-tf2128934.html#a9269838
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: [ANN] Overriding Similarity [ In reply to ]
> If you modify all scores to 100%, why can't you just ignore them?
Because in this case only, the scores will be all 100% but in other case
not. The fact is I don't want the the score depends on the frequency. I
don't want this "smart" feature.

> Anyhow, if all you want to modify is idf(), using the modified similarity
class only at search time is ok. For further pointers and explanations take
a look at http://lucene.apache.org/java/docs/scoring.html

thanks a lot.
--
View this message in context: http://www.nabble.com/Overriding-Similarity-tf2128934.html#a9281357
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org