Mailing List Archive

Boolean field type
Hey,

I've been musing about ideas for a "clever" Boolean field type on Lucene
for a while, and I think I might have an idea that could work. That said,
this popped into my head this afternoon and has not been fully-baked. It
may not be very clever at all.

My experience is that Boolean fields tend to be overwhelmingly true or
overwhelmingly false. I've had pretty good luck with using a keyword-style
field, where the only term represents the more sparse value. (For example,
I did a thing years ago with explicit tombstones, where versioned deletes
would have the field "deleted" with a value of "true", and live
documents didn't have the deleted field at all. Every query would add a
filter on "NOT deleted:true".)

That's great when you know up-front what the sparse value is going to be.
Working on OpenSearch, I just created an issue suggesting that we take a
hint from users for which value they think is going to be more common so we
only index the less common one:
https://github.com/opensearch-project/OpenSearch/issues/11143

At the Lucene level, though, we could index a Boolean field type as the
less common term when we flush (by counting the values and figuring out
which is less common). Then, per segment, we can rewrite any query for the
more common value as NOT the less common value.

You can compute upper/lower bounds on the value frequencies cheaply during
a merge, so I think you could usually write the doc IDs for the less common
value directly (without needing to count them first), even when input
segments disagree on which is the more common value.

If your Boolean field is not overwhelmingly lopsided, you might even want
to split segments to be 100% true or 100% false, such that queries against
the Boolean field become match-all or match-none. On a retail website,
maybe you have some toggle for "only show me results with property X" -- if
all your property X products are in one segment or a handful of segments,
you can drop the property X clause from the matching segments and skip the
other segments.

I guess one icky part of this compared to the usual Lucene field model is
that I'm assuming a Boolean field is never missing (or I guess missing
implies "false" by default?). Would that be a deal-breaker?

Thanks,
Froh
Re: Boolean field type [ In reply to ]
Hello Michael.
This optimization "NOT the less common value" assumes that boolean field is
required, but how to enforce this mandatory field constraint in Lucene? I'm
not aware of something like Solr schema or mapping.
If saying foo:true is common, it means that the posting list goes like
dense sequentially increasing numbers 1,2,3,4,5.. May it already be
compressed by codecs like
https://lucene.apache.org/core/9_2_0/core/org/apache/lucene/util/packed/MonotonicBlockPackedWriter.html
?

On Thu, Nov 9, 2023 at 3:31?AM Michael Froh <msfroh@gmail.com> wrote:

> Hey,
>
> I've been musing about ideas for a "clever" Boolean field type on Lucene
> for a while, and I think I might have an idea that could work. That said,
> this popped into my head this afternoon and has not been fully-baked. It
> may not be very clever at all.
>
> My experience is that Boolean fields tend to be overwhelmingly true or
> overwhelmingly false. I've had pretty good luck with using a keyword-style
> field, where the only term represents the more sparse value. (For example,
> I did a thing years ago with explicit tombstones, where versioned deletes
> would have the field "deleted" with a value of "true", and live
> documents didn't have the deleted field at all. Every query would add a
> filter on "NOT deleted:true".)
>
> That's great when you know up-front what the sparse value is going to be.
> Working on OpenSearch, I just created an issue suggesting that we take a
> hint from users for which value they think is going to be more common so we
> only index the less common one:
> https://github.com/opensearch-project/OpenSearch/issues/11143
>
> At the Lucene level, though, we could index a Boolean field type as the
> less common term when we flush (by counting the values and figuring out
> which is less common). Then, per segment, we can rewrite any query for the
> more common value as NOT the less common value.
>
> You can compute upper/lower bounds on the value frequencies cheaply during
> a merge, so I think you could usually write the doc IDs for the less common
> value directly (without needing to count them first), even when input
> segments disagree on which is the more common value.
>
> If your Boolean field is not overwhelmingly lopsided, you might even want
> to split segments to be 100% true or 100% false, such that queries against
> the Boolean field become match-all or match-none. On a retail website,
> maybe you have some toggle for "only show me results with property X" -- if
> all your property X products are in one segment or a handful of segments,
> you can drop the property X clause from the matching segments and skip the
> other segments.
>
> I guess one icky part of this compared to the usual Lucene field model is
> that I'm assuming a Boolean field is never missing (or I guess missing
> implies "false" by default?). Would that be a deal-breaker?
>
> Thanks,
> Froh
>


--
Sincerely yours
Mikhail Khludnev
Re: Boolean field type [ In reply to ]
Can you require the user to specify missing: true or missing: false
semantics. With that you can decide what to do with the missing values

On Thu, Nov 9, 2023, 7:55?AM Mikhail Khludnev <mkhl@apache.org> wrote:

> Hello Michael.
> This optimization "NOT the less common value" assumes that boolean field
> is required, but how to enforce this mandatory field constraint in Lucene?
> I'm not aware of something like Solr schema or mapping.
> If saying foo:true is common, it means that the posting list goes like
> dense sequentially increasing numbers 1,2,3,4,5.. May it already be
> compressed by codecs like
> https://lucene.apache.org/core/9_2_0/core/org/apache/lucene/util/packed/MonotonicBlockPackedWriter.html
> ?
>
> On Thu, Nov 9, 2023 at 3:31?AM Michael Froh <msfroh@gmail.com> wrote:
>
>> Hey,
>>
>> I've been musing about ideas for a "clever" Boolean field type on Lucene
>> for a while, and I think I might have an idea that could work. That said,
>> this popped into my head this afternoon and has not been fully-baked. It
>> may not be very clever at all.
>>
>> My experience is that Boolean fields tend to be overwhelmingly true or
>> overwhelmingly false. I've had pretty good luck with using a keyword-style
>> field, where the only term represents the more sparse value. (For example,
>> I did a thing years ago with explicit tombstones, where versioned deletes
>> would have the field "deleted" with a value of "true", and live
>> documents didn't have the deleted field at all. Every query would add a
>> filter on "NOT deleted:true".)
>>
>> That's great when you know up-front what the sparse value is going to be.
>> Working on OpenSearch, I just created an issue suggesting that we take a
>> hint from users for which value they think is going to be more common so we
>> only index the less common one:
>> https://github.com/opensearch-project/OpenSearch/issues/11143
>>
>> At the Lucene level, though, we could index a Boolean field type as the
>> less common term when we flush (by counting the values and figuring out
>> which is less common). Then, per segment, we can rewrite any query for the
>> more common value as NOT the less common value.
>>
>> You can compute upper/lower bounds on the value frequencies cheaply
>> during a merge, so I think you could usually write the doc IDs for the less
>> common value directly (without needing to count them first), even when
>> input segments disagree on which is the more common value.
>>
>> If your Boolean field is not overwhelmingly lopsided, you might even want
>> to split segments to be 100% true or 100% false, such that queries against
>> the Boolean field become match-all or match-none. On a retail website,
>> maybe you have some toggle for "only show me results with property X" -- if
>> all your property X products are in one segment or a handful of segments,
>> you can drop the property X clause from the matching segments and skip the
>> other segments.
>>
>> I guess one icky part of this compared to the usual Lucene field model is
>> that I'm assuming a Boolean field is never missing (or I guess missing
>> implies "false" by default?). Would that be a deal-breaker?
>>
>> Thanks,
>> Froh
>>
>
>
> --
> Sincerely yours
> Mikhail Khludnev
>
Re: Boolean field type [ In reply to ]
Thanks Mikhail and Mike!

Mikhail, since you replied, I remembered your work on block joins in Solr
(thank you for that, by the way!), which reminded me that it's not unusual
for docs in a Lucene index to "mix" their schemata, like in parent/child
blocks. If 90% of parent docs are "true" on a Boolean field, but the field
doesn't exist for the child docs, my suggested approach would probably see
"true" as the sparse value (assuming there are at least as many children as
parents). Ideally, I would want to only track the "false" parents (and
leave the field off of the children).

Indeed the idea of a "required" field in Lucene is tricky (though Mike's
suggestion of missing defaults could help). Even worse, I think we'd also
need a way to enforce "exactly one value", since a "Boolean" term field can
really have four states -- true, false, neither, or both.

It feels like there's not a workable short-term solution to implement
something like this as a regular IndexableField in Lucene (or at least I'm
not seeing it).

I don't think I see enough value to pitch the idea of adding a new
field-like "thing" (that would have exactly one value for every doc, and
maybe could be counted relative to docs in a block). For now, I think it's
probably only practical to implement something like this as part of a
schema definition in one of the higher-level search servers.

Thanks for the discussion!
Froh

On Thu, Nov 9, 2023 at 5:01?AM Michael Sokolov <msokolov@gmail.com> wrote:

> Can you require the user to specify missing: true or missing: false
> semantics. With that you can decide what to do with the missing values
>
> On Thu, Nov 9, 2023, 7:55?AM Mikhail Khludnev <mkhl@apache.org> wrote:
>
>> Hello Michael.
>> This optimization "NOT the less common value" assumes that boolean field
>> is required, but how to enforce this mandatory field constraint in Lucene?
>> I'm not aware of something like Solr schema or mapping.
>> If saying foo:true is common, it means that the posting list goes like
>> dense sequentially increasing numbers 1,2,3,4,5.. May it already be
>> compressed by codecs like
>> https://lucene.apache.org/core/9_2_0/core/org/apache/lucene/util/packed/MonotonicBlockPackedWriter.html
>> ?
>>
>> On Thu, Nov 9, 2023 at 3:31?AM Michael Froh <msfroh@gmail.com> wrote:
>>
>>> Hey,
>>>
>>> I've been musing about ideas for a "clever" Boolean field type on Lucene
>>> for a while, and I think I might have an idea that could work. That said,
>>> this popped into my head this afternoon and has not been fully-baked. It
>>> may not be very clever at all.
>>>
>>> My experience is that Boolean fields tend to be overwhelmingly true or
>>> overwhelmingly false. I've had pretty good luck with using a keyword-style
>>> field, where the only term represents the more sparse value. (For example,
>>> I did a thing years ago with explicit tombstones, where versioned deletes
>>> would have the field "deleted" with a value of "true", and live
>>> documents didn't have the deleted field at all. Every query would add a
>>> filter on "NOT deleted:true".)
>>>
>>> That's great when you know up-front what the sparse value is going to
>>> be. Working on OpenSearch, I just created an issue suggesting that we take
>>> a hint from users for which value they think is going to be more common so
>>> we only index the less common one:
>>> https://github.com/opensearch-project/OpenSearch/issues/11143
>>>
>>> At the Lucene level, though, we could index a Boolean field type as the
>>> less common term when we flush (by counting the values and figuring out
>>> which is less common). Then, per segment, we can rewrite any query for the
>>> more common value as NOT the less common value.
>>>
>>> You can compute upper/lower bounds on the value frequencies cheaply
>>> during a merge, so I think you could usually write the doc IDs for the less
>>> common value directly (without needing to count them first), even when
>>> input segments disagree on which is the more common value.
>>>
>>> If your Boolean field is not overwhelmingly lopsided, you might even
>>> want to split segments to be 100% true or 100% false, such that queries
>>> against the Boolean field become match-all or match-none. On a retail
>>> website, maybe you have some toggle for "only show me results with property
>>> X" -- if all your property X products are in one segment or a handful of
>>> segments, you can drop the property X clause from the matching segments and
>>> skip the other segments.
>>>
>>> I guess one icky part of this compared to the usual Lucene field model
>>> is that I'm assuming a Boolean field is never missing (or I guess missing
>>> implies "false" by default?). Would that be a deal-breaker?
>>>
>>> Thanks,
>>> Froh
>>>
>>
>>
>> --
>> Sincerely yours
>> Mikhail Khludnev
>>
>
Re: Boolean field type [ In reply to ]
Hi Michael/Mikhails, yet another Mike here:

If you create a NumericDocValuesField, and it only ever has one value per
doc (0, 1), I think the default Codec will compress it well, though maybe
not as well as your idea. It's a neat idea to notice a "very common
default value" and not store it and just store the sparse non-default
values. I don't think Codec does that today.

For the search-time opto, I thought somewhere in Lucene we do something
like your idea, converting to a negation if it has lower estimated
cardinality than the positive form. It might only be for points? If the
field were stored as postings, you could consult the metadata in the terms
dictionary to know the cardinality of each case, and perhaps that the field
is single valued and fully set (no missing values), at which point your
optimization logic might be able to apply during rewrite maybe?

Mike McCandless

http://blog.mikemccandless.com


On Fri, Nov 10, 2023 at 6:05?PM Michael Froh <msfroh@gmail.com> wrote:

> Thanks Mikhail and Mike!
>
> Mikhail, since you replied, I remembered your work on block joins in Solr
> (thank you for that, by the way!), which reminded me that it's not unusual
> for docs in a Lucene index to "mix" their schemata, like in parent/child
> blocks. If 90% of parent docs are "true" on a Boolean field, but the field
> doesn't exist for the child docs, my suggested approach would probably see
> "true" as the sparse value (assuming there are at least as many children as
> parents). Ideally, I would want to only track the "false" parents (and
> leave the field off of the children).
>
> Indeed the idea of a "required" field in Lucene is tricky (though Mike's
> suggestion of missing defaults could help). Even worse, I think we'd also
> need a way to enforce "exactly one value", since a "Boolean" term field can
> really have four states -- true, false, neither, or both.
>
> It feels like there's not a workable short-term solution to implement
> something like this as a regular IndexableField in Lucene (or at least I'm
> not seeing it).
>
> I don't think I see enough value to pitch the idea of adding a new
> field-like "thing" (that would have exactly one value for every doc, and
> maybe could be counted relative to docs in a block). For now, I think it's
> probably only practical to implement something like this as part of a
> schema definition in one of the higher-level search servers.
>
> Thanks for the discussion!
> Froh
>
> On Thu, Nov 9, 2023 at 5:01?AM Michael Sokolov <msokolov@gmail.com> wrote:
>
>> Can you require the user to specify missing: true or missing: false
>> semantics. With that you can decide what to do with the missing values
>>
>> On Thu, Nov 9, 2023, 7:55?AM Mikhail Khludnev <mkhl@apache.org> wrote:
>>
>>> Hello Michael.
>>> This optimization "NOT the less common value" assumes that boolean field
>>> is required, but how to enforce this mandatory field constraint in Lucene?
>>> I'm not aware of something like Solr schema or mapping.
>>> If saying foo:true is common, it means that the posting list goes like
>>> dense sequentially increasing numbers 1,2,3,4,5.. May it already be
>>> compressed by codecs like
>>> https://lucene.apache.org/core/9_2_0/core/org/apache/lucene/util/packed/MonotonicBlockPackedWriter.html
>>> ?
>>>
>>> On Thu, Nov 9, 2023 at 3:31?AM Michael Froh <msfroh@gmail.com> wrote:
>>>
>>>> Hey,
>>>>
>>>> I've been musing about ideas for a "clever" Boolean field type on
>>>> Lucene for a while, and I think I might have an idea that could work. That
>>>> said, this popped into my head this afternoon and has not been fully-baked.
>>>> It may not be very clever at all.
>>>>
>>>> My experience is that Boolean fields tend to be overwhelmingly true or
>>>> overwhelmingly false. I've had pretty good luck with using a keyword-style
>>>> field, where the only term represents the more sparse value. (For example,
>>>> I did a thing years ago with explicit tombstones, where versioned deletes
>>>> would have the field "deleted" with a value of "true", and live
>>>> documents didn't have the deleted field at all. Every query would add a
>>>> filter on "NOT deleted:true".)
>>>>
>>>> That's great when you know up-front what the sparse value is going to
>>>> be. Working on OpenSearch, I just created an issue suggesting that we take
>>>> a hint from users for which value they think is going to be more common so
>>>> we only index the less common one:
>>>> https://github.com/opensearch-project/OpenSearch/issues/11143
>>>>
>>>> At the Lucene level, though, we could index a Boolean field type as the
>>>> less common term when we flush (by counting the values and figuring out
>>>> which is less common). Then, per segment, we can rewrite any query for the
>>>> more common value as NOT the less common value.
>>>>
>>>> You can compute upper/lower bounds on the value frequencies cheaply
>>>> during a merge, so I think you could usually write the doc IDs for the less
>>>> common value directly (without needing to count them first), even when
>>>> input segments disagree on which is the more common value.
>>>>
>>>> If your Boolean field is not overwhelmingly lopsided, you might even
>>>> want to split segments to be 100% true or 100% false, such that queries
>>>> against the Boolean field become match-all or match-none. On a retail
>>>> website, maybe you have some toggle for "only show me results with property
>>>> X" -- if all your property X products are in one segment or a handful of
>>>> segments, you can drop the property X clause from the matching segments and
>>>> skip the other segments.
>>>>
>>>> I guess one icky part of this compared to the usual Lucene field model
>>>> is that I'm assuming a Boolean field is never missing (or I guess missing
>>>> implies "false" by default?). Would that be a deal-breaker?
>>>>
>>>> Thanks,
>>>> Froh
>>>>
>>>
>>>
>>> --
>>> Sincerely yours
>>> Mikhail Khludnev
>>>
>>