Mailing List Archive

Distributed IDF for Solr using ExactStatsCache issue
Hello,

I am using Solr in a distributed environment where I have split my collection into parts, which I have running on different nodes. When I create each part of the collection, I set numShards and replicationFactor to 1. The query speed is most important to us, and we are not worried about load on the system.

I want a Distributed IDF across all parts of the collection so I have added this line to my solrconfig.xml:
<statsCache class="org.apache.solr.search.stats.ExactStatsCache" />

This seems to work about 90% of the time, but if I run the same request over and over again, sometimes I get scores with a local IDF for just one part of the collection. Here is a request example:
/solr/collection1,collection2/query?q=fulltext:shark&rows=500&fl=id,url,title,score&sort=score+desc

I still get documents from both collection1 and collection2, but sometimes I get scores that are the same as when I would just query collection1. I believe that it is only using the document frequency of collection one for the term in that case.

Should I use a different configuration? I would like to make sure the IDF is always distributed and the same every time I run the same query. Is there any technique I could use to ensure that this happens?

Thank you,
Cameron VandenBerg
Re: Distributed IDF for Solr using ExactStatsCache issue [ In reply to ]
Hi,

You may want to ask this question in the Solr Users mailing list instead of this one which is dedicated to the Lucene Java library - https://solr.apache.org/community.html#mailing-lists-chat <https://solr.apache.org/community.html#mailing-lists-chat>

Jan

> 16. mar. 2021 kl. 20:55 skrev Cameron M VandenBerg <cmw2@cs.cmu.edu>:
>
> Hello,
>
> I am using Solr in a distributed environment where I have split my collection into parts, which I have running on different nodes. When I create each part of the collection, I set numShards and replicationFactor to 1. The query speed is most important to us, and we are not worried about load on the system.
>
> I want a Distributed IDF across all parts of the collection so I have added this line to my solrconfig.xml:
> <statsCache class="org.apache.solr.search.stats.ExactStatsCache" />
>
> This seems to work about 90% of the time, but if I run the same request over and over again, sometimes I get scores with a local IDF for just one part of the collection. Here is a request example:
> /solr/collection1,collection2/query?q=fulltext:shark&rows=500&fl=id,url,title,score&sort=score+desc
>
> I still get documents from both collection1 and collection2, but sometimes I get scores that are the same as when I would just query collection1. I believe that it is only using the document frequency of collection one for the term in that case.
>
> Should I use a different configuration? I would like to make sure the IDF is always distributed and the same every time I run the same query. Is there any technique I could use to ensure that this happens?
>
> Thank you,
> Cameron VandenBerg