Mailing List Archive

best way (performance wise) to search for field without value?
Hi all,

We have implemented some security on our index by adding a field
'groups_allowed' to documents, and wrap a boolean must query around the
original query, that checks if one of the given user-groups matches at
least one groups_allowed.

We chose to leave the groups_allowed field empty when the document should
able to be retrieved by all users, so we need to also select a document if
the 'groups_allowed' is empty.

What would be the faster Query construction to do so?


Currently I use a TermRangeQuery that basically matches all values and put
that in a MUST_NOT combined with a MatchAllDocumentQuery(), but that gets
rather slow then the number of groups is high.

Thanks!
Re: best way (performance wise) to search for field without value? [ In reply to ]
Maybe NormsFieldExistsQuery as a MUST_NOT clause? Though, you must enable
norms on your field to use that.

TermRangeQuery is indeed a horribly costly way to execute this, but if you
cache the result on each refresh, perhaps it is OK?

You could also index a dedicated doc values field indicating that the field
empty and then use DocValuesFieldExistsQuery.

Mike McCandless

http://blog.mikemccandless.com


On Fri, Nov 13, 2020 at 7:56 AM Rob Audenaerde <rob.audenaerde@gmail.com>
wrote:

> Hi all,
>
> We have implemented some security on our index by adding a field
> 'groups_allowed' to documents, and wrap a boolean must query around the
> original query, that checks if one of the given user-groups matches at
> least one groups_allowed.
>
> We chose to leave the groups_allowed field empty when the document should
> able to be retrieved by all users, so we need to also select a document if
> the 'groups_allowed' is empty.
>
> What would be the faster Query construction to do so?
>
>
> Currently I use a TermRangeQuery that basically matches all values and put
> that in a MUST_NOT combined with a MatchAllDocumentQuery(), but that gets
> rather slow then the number of groups is high.
>
> Thanks!
>
Re: best way (performance wise) to search for field without value? [ In reply to ]
That's great Rob! Thanks for bringing closure.

Mike McCandless

http://blog.mikemccandless.com


On Fri, Nov 13, 2020 at 9:13 AM Rob Audenaerde <rob.audenaerde@gmail.com>
wrote:

> To follow up, based on a quick JMH-test with 2M docs with some random data
> I see a speedup of 70% :)
> That is a nice friday-afternoon gift, thanks!
>
> For ppl that are interested:
>
> I added a BinaryDocValues field like this:
>
> doc.add(BinaryDocValuesField("GROUPS_ALLOWED_EMPTY", new BytesRef(0x01))));
>
> And used the finalQuery.add(new DocValuesFieldExistsQuery("
> GROUPS_ALLOWED_EMPTY", BooleanClause.Occur.SHOULD);
>
> On Fri, Nov 13, 2020 at 2:09 PM Michael McCandless <
> lucene@mikemccandless.com> wrote:
>
> > Maybe NormsFieldExistsQuery as a MUST_NOT clause? Though, you must
> enable
> > norms on your field to use that.
> >
> > TermRangeQuery is indeed a horribly costly way to execute this, but if
> you
> > cache the result on each refresh, perhaps it is OK?
> >
> > You could also index a dedicated doc values field indicating that the
> > field empty and then use DocValuesFieldExistsQuery.
> >
> > Mike McCandless
> >
> > http://blog.mikemccandless.com
> >
> >
> > On Fri, Nov 13, 2020 at 7:56 AM Rob Audenaerde <rob.audenaerde@gmail.com
> >
> > wrote:
> >
> >> Hi all,
> >>
> >> We have implemented some security on our index by adding a field
> >> 'groups_allowed' to documents, and wrap a boolean must query around the
> >> original query, that checks if one of the given user-groups matches at
> >> least one groups_allowed.
> >>
> >> We chose to leave the groups_allowed field empty when the document
> should
> >> able to be retrieved by all users, so we need to also select a document
> if
> >> the 'groups_allowed' is empty.
> >>
> >> What would be the faster Query construction to do so?
> >>
> >>
> >> Currently I use a TermRangeQuery that basically matches all values and
> put
> >> that in a MUST_NOT combined with a MatchAllDocumentQuery(), but that
> gets
> >> rather slow then the number of groups is high.
> >>
> >> Thanks!
> >>
> >
>
Re: best way (performance wise) to search for field without value? [ In reply to ]
Hi,

Solr and Elasticsearch implement the exists query like this, which is fully in line with your investigation: if a field has docvalues it uses DocValuesFieldExistsQuery, if it is a tokenized field it uses the NormsFieldExistsQuery. The negative one is a must-not clause, which is perfectly fine performance wise.

An alternative way to search is indexing all field names that have a value into a separate stringfield. But this needs preprocessing.

https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-exists-query.html

https://issues.apache.org/jira/browse/SOLR-11437

Uwe

Am November 13, 2020 2:19:43 PM UTC schrieb Michael McCandless <lucene@mikemccandless.com>:
>That's great Rob! Thanks for bringing closure.
>
>Mike McCandless
>
>http://blog.mikemccandless.com
>
>
>On Fri, Nov 13, 2020 at 9:13 AM Rob Audenaerde
><rob.audenaerde@gmail.com>
>wrote:
>
>> To follow up, based on a quick JMH-test with 2M docs with some random
>data
>> I see a speedup of 70% :)
>> That is a nice friday-afternoon gift, thanks!
>>
>> For ppl that are interested:
>>
>> I added a BinaryDocValues field like this:
>>
>> doc.add(BinaryDocValuesField("GROUPS_ALLOWED_EMPTY", new
>BytesRef(0x01))));
>>
>> And used the finalQuery.add(new DocValuesFieldExistsQuery("
>> GROUPS_ALLOWED_EMPTY", BooleanClause.Occur.SHOULD);
>>
>> On Fri, Nov 13, 2020 at 2:09 PM Michael McCandless <
>> lucene@mikemccandless.com> wrote:
>>
>> > Maybe NormsFieldExistsQuery as a MUST_NOT clause? Though, you must
>> enable
>> > norms on your field to use that.
>> >
>> > TermRangeQuery is indeed a horribly costly way to execute this, but
>if
>> you
>> > cache the result on each refresh, perhaps it is OK?
>> >
>> > You could also index a dedicated doc values field indicating that
>the
>> > field empty and then use DocValuesFieldExistsQuery.
>> >
>> > Mike McCandless
>> >
>> > http://blog.mikemccandless.com
>> >
>> >
>> > On Fri, Nov 13, 2020 at 7:56 AM Rob Audenaerde
><rob.audenaerde@gmail.com
>> >
>> > wrote:
>> >
>> >> Hi all,
>> >>
>> >> We have implemented some security on our index by adding a field
>> >> 'groups_allowed' to documents, and wrap a boolean must query
>around the
>> >> original query, that checks if one of the given user-groups
>matches at
>> >> least one groups_allowed.
>> >>
>> >> We chose to leave the groups_allowed field empty when the document
>> should
>> >> able to be retrieved by all users, so we need to also select a
>document
>> if
>> >> the 'groups_allowed' is empty.
>> >>
>> >> What would be the faster Query construction to do so?
>> >>
>> >>
>> >> Currently I use a TermRangeQuery that basically matches all values
>and
>> put
>> >> that in a MUST_NOT combined with a MatchAllDocumentQuery(), but
>that
>> gets
>> >> rather slow then the number of groups is high.
>> >>
>> >> Thanks!
>> >>
>> >
>>

--
Uwe Schindler
Achterdiek 19, 28357 Bremen
https://www.thetaphi.de
Re: best way (performance wise) to search for field without value? [ In reply to ]
With Zulia we chose to rewrite fieldName:* queries to hiddenField:fieldName
and add all field names that are present to a hidden field automatically as
Uwe described as an alternative. It seems to work well.

https://github.com/zuliaio/zuliasearch/blob/master/zulia-query-parser/src/main/java/io/zulia/server/search/ZuliaQueryParser.java#L218
https://github.com/zuliaio/zuliasearch/blob/master/zulia-server/src/main/java/io/zulia/server/index/ShardDocumentIndexer.java#L122

~Matt

On Fri, Nov 13, 2020 at 9:50 AM Uwe Schindler <uwe@thetaphi.de> wrote:

> Hi,
>
> Solr and Elasticsearch implement the exists query like this, which is
> fully in line with your investigation: if a field has docvalues it uses
> DocValuesFieldExistsQuery, if it is a tokenized field it uses the
> NormsFieldExistsQuery. The negative one is a must-not clause, which is
> perfectly fine performance wise.
>
> An alternative way to search is indexing all field names that have a value
> into a separate stringfield. But this needs preprocessing.
>
>
> https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-exists-query.html
>
> https://issues.apache.org/jira/browse/SOLR-11437
>
> Uwe
>
> Am November 13, 2020 2:19:43 PM UTC schrieb Michael McCandless <
> lucene@mikemccandless.com>:
> >That's great Rob! Thanks for bringing closure.
> >
> >Mike McCandless
> >
> >http://blog.mikemccandless.com
> >
> >
> >On Fri, Nov 13, 2020 at 9:13 AM Rob Audenaerde
> ><rob.audenaerde@gmail.com>
> >wrote:
> >
> >> To follow up, based on a quick JMH-test with 2M docs with some random
> >data
> >> I see a speedup of 70% :)
> >> That is a nice friday-afternoon gift, thanks!
> >>
> >> For ppl that are interested:
> >>
> >> I added a BinaryDocValues field like this:
> >>
> >> doc.add(BinaryDocValuesField("GROUPS_ALLOWED_EMPTY", new
> >BytesRef(0x01))));
> >>
> >> And used the finalQuery.add(new DocValuesFieldExistsQuery("
> >> GROUPS_ALLOWED_EMPTY", BooleanClause.Occur.SHOULD);
> >>
> >> On Fri, Nov 13, 2020 at 2:09 PM Michael McCandless <
> >> lucene@mikemccandless.com> wrote:
> >>
> >> > Maybe NormsFieldExistsQuery as a MUST_NOT clause? Though, you must
> >> enable
> >> > norms on your field to use that.
> >> >
> >> > TermRangeQuery is indeed a horribly costly way to execute this, but
> >if
> >> you
> >> > cache the result on each refresh, perhaps it is OK?
> >> >
> >> > You could also index a dedicated doc values field indicating that
> >the
> >> > field empty and then use DocValuesFieldExistsQuery.
> >> >
> >> > Mike McCandless
> >> >
> >> > http://blog.mikemccandless.com
> >> >
> >> >
> >> > On Fri, Nov 13, 2020 at 7:56 AM Rob Audenaerde
> ><rob.audenaerde@gmail.com
> >> >
> >> > wrote:
> >> >
> >> >> Hi all,
> >> >>
> >> >> We have implemented some security on our index by adding a field
> >> >> 'groups_allowed' to documents, and wrap a boolean must query
> >around the
> >> >> original query, that checks if one of the given user-groups
> >matches at
> >> >> least one groups_allowed.
> >> >>
> >> >> We chose to leave the groups_allowed field empty when the document
> >> should
> >> >> able to be retrieved by all users, so we need to also select a
> >document
> >> if
> >> >> the 'groups_allowed' is empty.
> >> >>
> >> >> What would be the faster Query construction to do so?
> >> >>
> >> >>
> >> >> Currently I use a TermRangeQuery that basically matches all values
> >and
> >> put
> >> >> that in a MUST_NOT combined with a MatchAllDocumentQuery(), but
> >that
> >> gets
> >> >> rather slow then the number of groups is high.
> >> >>
> >> >> Thanks!
> >> >>
> >> >
> >>
>
> --
> Uwe Schindler
> Achterdiek 19, 28357 Bremen
> https://www.thetaphi.de