Hi Team,
I was discussing this problem with Greg Miller (also at Amazon Product
Search):
If I want to make a query that filters out a few primary keys (ASIN in our
Amazon Product Search world), I can make a TermInSetQuery and add it as a
MUST_NOT onto a BooleanQuery that has all the other interesting clauses for
my query.
But if I have many, many ASINs to filter out, at some point it may become
more efficient to just use doc values and filter them out like Solr's
"post-filter" / during collection, e.g. by loading the BINARY value or
SORTED (globalized) ordinal, and checking e.g. a HashSet to see if it
should be skipped. Not using the inverted index at all...
Do we already have such a "slow DV TermInSet" query?
It seems like it could belong in SortedDocValues where we already have
newSlowRangeQuery, newSlowExactQuery, we could add a newSlowInSetQuery?
And then we could make an IndexOrDocValuesQuery with both the
TermInSetQuery and this SDV.newSlowInSetQuery?
Or maybe there is already a good way to do this in Lucene?
Thanks!,
Mike McCandless
http://blog.mikemccandless.com
I was discussing this problem with Greg Miller (also at Amazon Product
Search):
If I want to make a query that filters out a few primary keys (ASIN in our
Amazon Product Search world), I can make a TermInSetQuery and add it as a
MUST_NOT onto a BooleanQuery that has all the other interesting clauses for
my query.
But if I have many, many ASINs to filter out, at some point it may become
more efficient to just use doc values and filter them out like Solr's
"post-filter" / during collection, e.g. by loading the BINARY value or
SORTED (globalized) ordinal, and checking e.g. a HashSet to see if it
should be skipped. Not using the inverted index at all...
Do we already have such a "slow DV TermInSet" query?
It seems like it could belong in SortedDocValues where we already have
newSlowRangeQuery, newSlowExactQuery, we could add a newSlowInSetQuery?
And then we could make an IndexOrDocValuesQuery with both the
TermInSetQuery and this SDV.newSlowInSetQuery?
Or maybe there is already a good way to do this in Lucene?
Thanks!,
Mike McCandless
http://blog.mikemccandless.com