Mailing List Archive

Adding a new PointDocValuesField
Hi,

Some background: I've been working on this PR to add hyper
rectangle faceting <https://github.com/apache/lucene/pull/841> capabilities
to Lucene facets and I needed to create a new doc values field to support
this feature. Initially, I had a field that just extended BinaryDocValues,
but then a discussion came up about whether to add a completely new
DocValues field, maybe something like PointDocValuesField (and
SortedPointDocValuesField as the multivalued version) to add first class
support for this new field. Here is the link to the discussion
<https://github.com/apache/lucene/pull/841#discussion_r879869751>. I think
there are a few benefits to this:

- Formalize how we would store points as doc values rather than just
packing points into a BinaryDocValues field in a format that could change
at any time
- NumericDocValues enables us to create a SortedNumericDocValuesRange
query which can be used with IndexOrDocValuesQuery to make some range
queries more efficient. Adding this new doc values field would let us do
the same thing with higher dimensional ranges

I'm sure I could be missing some benefits, and I also am not super
experienced with Lucene so there could be drawbacks I am missing as well
:). From what I understand though, Lucene doesn't have a lot of DocValues
fields and there should be some thought put into adding new ones, so I was
wondering if I could get some feedback about the idea. Thanks!
Re: Adding a new PointDocValuesField [ In reply to ]
Hi Marc
Thank you for starting the discussion, I think all your points make sense,
but I'm wondering if we really need everything packed into one field? And
what are the advantages of doing that? I *think* most of the facet related
use cases can be satisfied using multiple fields, one field per dimension.
And for use cases outside of the facet world, I'm not sure how could
multiple dimensions packed in one field be more useful than each
dimension having its own field?

Best
Patrick

On Mon, May 23, 2022 at 7:17 PM Marc D'Mello <marcd2000@gmail.com> wrote:

> Hi,
>
> Some background: I've been working on this PR to add hyper
> rectangle faceting <https://github.com/apache/lucene/pull/841>
> capabilities to Lucene facets and I needed to create a new doc values field
> to support this feature. Initially, I had a field that just extended
> BinaryDocValues, but then a discussion came up about whether to add a
> completely new DocValues field, maybe something like PointDocValuesField
> (and SortedPointDocValuesField as the multivalued version) to add first
> class support for this new field. Here is the link to the discussion
> <https://github.com/apache/lucene/pull/841#discussion_r879869751>. I
> think there are a few benefits to this:
>
> - Formalize how we would store points as doc values rather than just
> packing points into a BinaryDocValues field in a format that could change
> at any time
> - NumericDocValues enables us to create a SortedNumericDocValuesRange
> query which can be used with IndexOrDocValuesQuery to make some range
> queries more efficient. Adding this new doc values field would let us do
> the same thing with higher dimensional ranges
>
> I'm sure I could be missing some benefits, and I also am not super
> experienced with Lucene so there could be drawbacks I am missing as well
> :). From what I understand though, Lucene doesn't have a lot of DocValues
> fields and there should be some thought put into adding new ones, so I was
> wondering if I could get some feedback about the idea. Thanks!
>
Re: Adding a new PointDocValuesField [ In reply to ]
This seems really exotic feature to add a dedicated docvalues field for.

We should let BINARY be the catchall for stuff like this.

On Mon, May 23, 2022 at 10:17 PM Marc D'Mello <marcd2000@gmail.com> wrote:
>
> Hi,
>
> Some background: I've been working on this PR to add hyper rectangle faceting capabilities to Lucene facets and I needed to create a new doc values field to support this feature. Initially, I had a field that just extended BinaryDocValues, but then a discussion came up about whether to add a completely new DocValues field, maybe something like PointDocValuesField (and SortedPointDocValuesField as the multivalued version) to add first class support for this new field. Here is the link to the discussion. I think there are a few benefits to this:
>
> Formalize how we would store points as doc values rather than just packing points into a BinaryDocValues field in a format that could change at any time
> NumericDocValues enables us to create a SortedNumericDocValuesRange query which can be used with IndexOrDocValuesQuery to make some range queries more efficient. Adding this new doc values field would let us do the same thing with higher dimensional ranges
>
> I'm sure I could be missing some benefits, and I also am not super experienced with Lucene so there could be drawbacks I am missing as well :). From what I understand though, Lucene doesn't have a lot of DocValues fields and there should be some thought put into adding new ones, so I was wondering if I could get some feedback about the idea. Thanks!

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org
Re: Adding a new PointDocValuesField [ In reply to ]
Hi,

Thanks for the responses! For Patrick's question, right now in faceting we
don't have any good way to AND between two fields. I think the original
hyper rectangle issue has a good example of a use case:
https://issues.apache.org/jira/browse/LUCENE-10274.

As for Robert's point, this feature would also allow us to use
MultiRangeQuery in IndexOrDocValuesQuery, but MultiRangeQuery is itself in
the sandbox module so I'm assuming that's a pretty exotic use case as well.
I personally have no issues using BinaryDocValues for this, I was just
wondering if it would be better to create a dedicated doc values, but it
seems that is not that case.

Thanks,
Marc

On Tue, May 24, 2022 at 1:27 AM Robert Muir <rcmuir@gmail.com> wrote:

> This seems really exotic feature to add a dedicated docvalues field for.
>
> We should let BINARY be the catchall for stuff like this.
>
> On Mon, May 23, 2022 at 10:17 PM Marc D'Mello <marcd2000@gmail.com> wrote:
> >
> > Hi,
> >
> > Some background: I've been working on this PR to add hyper rectangle
> faceting capabilities to Lucene facets and I needed to create a new doc
> values field to support this feature. Initially, I had a field that just
> extended BinaryDocValues, but then a discussion came up about whether to
> add a completely new DocValues field, maybe something like
> PointDocValuesField (and SortedPointDocValuesField as the multivalued
> version) to add first class support for this new field. Here is the link to
> the discussion. I think there are a few benefits to this:
> >
> > Formalize how we would store points as doc values rather than just
> packing points into a BinaryDocValues field in a format that could change
> at any time
> > NumericDocValues enables us to create a SortedNumericDocValuesRange
> query which can be used with IndexOrDocValuesQuery to make some range
> queries more efficient. Adding this new doc values field would let us do
> the same thing with higher dimensional ranges
> >
> > I'm sure I could be missing some benefits, and I also am not super
> experienced with Lucene so there could be drawbacks I am missing as well
> :). From what I understand though, Lucene doesn't have a lot of DocValues
> fields and there should be some thought put into adding new ones, so I was
> wondering if I could get some feedback about the idea. Thanks!
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>
Re: Adding a new PointDocValuesField [ In reply to ]
As pointed out by Rob in the issue

I would also suggest to start with the simple
> separate-numeric-docvalues-fields case and use similar logic as the
> org.apache.lucene.facet.range package, just on 2-D, or maybe 3-D, N-D, etc


I think that's a preferable solution to me, because:
1. It does not couple the dimensions together so that people can combine
them freely
2. It might be able to be compressed better

Best

On Tue, May 24, 2022 at 9:08 AM Marc D'Mello <marcd2000@gmail.com> wrote:

> Hi,
>
> Thanks for the responses! For Patrick's question, right now in faceting we
> don't have any good way to AND between two fields. I think the original
> hyper rectangle issue has a good example of a use case:
> https://issues.apache.org/jira/browse/LUCENE-10274.
>
> As for Robert's point, this feature would also allow us to use
> MultiRangeQuery in IndexOrDocValuesQuery, but MultiRangeQuery is itself in
> the sandbox module so I'm assuming that's a pretty exotic use case as well.
> I personally have no issues using BinaryDocValues for this, I was just
> wondering if it would be better to create a dedicated doc values, but it
> seems that is not that case.
>
> Thanks,
> Marc
>
> On Tue, May 24, 2022 at 1:27 AM Robert Muir <rcmuir@gmail.com> wrote:
>
>> This seems really exotic feature to add a dedicated docvalues field for.
>>
>> We should let BINARY be the catchall for stuff like this.
>>
>> On Mon, May 23, 2022 at 10:17 PM Marc D'Mello <marcd2000@gmail.com>
>> wrote:
>> >
>> > Hi,
>> >
>> > Some background: I've been working on this PR to add hyper rectangle
>> faceting capabilities to Lucene facets and I needed to create a new doc
>> values field to support this feature. Initially, I had a field that just
>> extended BinaryDocValues, but then a discussion came up about whether to
>> add a completely new DocValues field, maybe something like
>> PointDocValuesField (and SortedPointDocValuesField as the multivalued
>> version) to add first class support for this new field. Here is the link to
>> the discussion. I think there are a few benefits to this:
>> >
>> > Formalize how we would store points as doc values rather than just
>> packing points into a BinaryDocValues field in a format that could change
>> at any time
>> > NumericDocValues enables us to create a SortedNumericDocValuesRange
>> query which can be used with IndexOrDocValuesQuery to make some range
>> queries more efficient. Adding this new doc values field would let us do
>> the same thing with higher dimensional ranges
>> >
>> > I'm sure I could be missing some benefits, and I also am not super
>> experienced with Lucene so there could be drawbacks I am missing as well
>> :). From what I understand though, Lucene doesn't have a lot of DocValues
>> fields and there should be some thought put into adding new ones, so I was
>> wondering if I could get some feedback about the idea. Thanks!
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: dev-help@lucene.apache.org
>>
>>
Re: Adding a new PointDocValuesField [ In reply to ]
Thanks for the comments Patrick, but I'm not sure I'm fully
understanding the suggestion here. I don't see a path forward that
uses different fields, but maybe I'm missing something. Imagine you're
running an ecommerce site selling automotive parts and you need to
index fitment information that consists of the year + make of vehicles
a part fits. Imagine a set of wiper blades fit 2010 Ford vehicles and
2011 Chevy vehicles (but _not_ 2011 Ford or 2010 Chevy). And let's say
we want to facet on products that fit a 2011 Ford. We need to make
sure this product does _not_ count. We can achieve this with points in
two dimensions (year + make), but not as two separate fields (at least
as far as I can come up with). A "two separate field approach" would
consist of indexing year and make separately, and you'd lose the
information that only certain combinations are valid. Am I overlooking
something with your suggestion? Maybe there's something we can do with
Lucene already that solves for this case and I'm just not aware of it?
That's entirely possible and I'd love to learn more if there is!

As for MultiRangeQuery and the mention of sandbox modules, I think
that's a bit of a different use-case. MultiRangeQuery lets you filter
by a disjunction of ranges. The "multi" part doesn't relate to
"multiple values in a doc" (but it does support that, as do the
"standard" range queries).

Where I see a gap right now, beyond just faceting, is that we can
represent N-dim points in the points index and filter on them (using
the points index), but we have no doc values equivalent. This means,
1) we can't facet, and 2) we can't create a "slow" query that does
post-filtering instead of using the points index (which could be a
very real advantage in cases with a sparse match set but a dense
points index). So I like the idea of creating that concept and being
able to facet and filter on it. Whether-or-not this is a "formal" doc
values type or sits on top of BDV, I have less of a strong opinion.

And finally... it really should be multi-valued. The points index
supports multiple points-per-field within a single document. Seems
like a big gap that we wouldn't support that with a doc value field.
Because BDV is inherently single-valued, I propose we come up with an
encoding scheme that encodes multiple points on top of that "single"
BDV entry. This is where building on BDV started to feel a little icky
to me and it seemed like it might be a good use-case for actually
formalizing a format/encoding, but again, no strong preference. We
could certainly do something more quickly on top of BDV and formalize
an encoding later if/as necessary.

Thanks again for the discussion so far Marc, Partrick and Rob!

Cheers,
-Greg

On Tue, May 24, 2022 at 10:35 AM Patrick Zhai <zhai7631@gmail.com> wrote:
>
> As pointed out by Rob in the issue
>
>> I would also suggest to start with the simple separate-numeric-docvalues-fields case and use similar logic as the org.apache.lucene.facet.range package, just on 2-D, or maybe 3-D, N-D, etc
>
>
> I think that's a preferable solution to me, because:
> 1. It does not couple the dimensions together so that people can combine them freely
> 2. It might be able to be compressed better
>
> Best
>
> On Tue, May 24, 2022 at 9:08 AM Marc D'Mello <marcd2000@gmail.com> wrote:
>>
>> Hi,
>>
>> Thanks for the responses! For Patrick's question, right now in faceting we don't have any good way to AND between two fields. I think the original hyper rectangle issue has a good example of a use case: https://issues.apache.org/jira/browse/LUCENE-10274.
>>
>> As for Robert's point, this feature would also allow us to use MultiRangeQuery in IndexOrDocValuesQuery, but MultiRangeQuery is itself in the sandbox module so I'm assuming that's a pretty exotic use case as well. I personally have no issues using BinaryDocValues for this, I was just wondering if it would be better to create a dedicated doc values, but it seems that is not that case.
>>
>> Thanks,
>> Marc
>>
>> On Tue, May 24, 2022 at 1:27 AM Robert Muir <rcmuir@gmail.com> wrote:
>>>
>>> This seems really exotic feature to add a dedicated docvalues field for.
>>>
>>> We should let BINARY be the catchall for stuff like this.
>>>
>>> On Mon, May 23, 2022 at 10:17 PM Marc D'Mello <marcd2000@gmail.com> wrote:
>>> >
>>> > Hi,
>>> >
>>> > Some background: I've been working on this PR to add hyper rectangle faceting capabilities to Lucene facets and I needed to create a new doc values field to support this feature. Initially, I had a field that just extended BinaryDocValues, but then a discussion came up about whether to add a completely new DocValues field, maybe something like PointDocValuesField (and SortedPointDocValuesField as the multivalued version) to add first class support for this new field. Here is the link to the discussion. I think there are a few benefits to this:
>>> >
>>> > Formalize how we would store points as doc values rather than just packing points into a BinaryDocValues field in a format that could change at any time
>>> > NumericDocValues enables us to create a SortedNumericDocValuesRange query which can be used with IndexOrDocValuesQuery to make some range queries more efficient. Adding this new doc values field would let us do the same thing with higher dimensional ranges
>>> >
>>> > I'm sure I could be missing some benefits, and I also am not super experienced with Lucene so there could be drawbacks I am missing as well :). From what I understand though, Lucene doesn't have a lot of DocValues fields and there should be some thought put into adding new ones, so I was wondering if I could get some feedback about the idea. Thanks!
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: dev-help@lucene.apache.org
>>>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org
Re: Adding a new PointDocValuesField [ In reply to ]
Hi Greg, thanks for the explanation! The example makes perfect sense to me,
I was under the impression that this was combining two independent fields
and I was wrong.

I'm not biased towards having or not a new field for it, but for
multi-value, don't we have a SortedSetDocValuesField that works as a
multi-value version of BDV?

Best
Patrick

On Tue, May 24, 2022 at 9:17 PM Greg Miller <gsmiller@gmail.com> wrote:

> Thanks for the comments Patrick, but I'm not sure I'm fully
> understanding the suggestion here. I don't see a path forward that
> uses different fields, but maybe I'm missing something. Imagine you're
> running an ecommerce site selling automotive parts and you need to
> index fitment information that consists of the year + make of vehicles
> a part fits. Imagine a set of wiper blades fit 2010 Ford vehicles and
> 2011 Chevy vehicles (but _not_ 2011 Ford or 2010 Chevy). And let's say
> we want to facet on products that fit a 2011 Ford. We need to make
> sure this product does _not_ count. We can achieve this with points in
> two dimensions (year + make), but not as two separate fields (at least
> as far as I can come up with). A "two separate field approach" would
> consist of indexing year and make separately, and you'd lose the
> information that only certain combinations are valid. Am I overlooking
> something with your suggestion? Maybe there's something we can do with
> Lucene already that solves for this case and I'm just not aware of it?
> That's entirely possible and I'd love to learn more if there is!
>
> As for MultiRangeQuery and the mention of sandbox modules, I think
> that's a bit of a different use-case. MultiRangeQuery lets you filter
> by a disjunction of ranges. The "multi" part doesn't relate to
> "multiple values in a doc" (but it does support that, as do the
> "standard" range queries).
>
> Where I see a gap right now, beyond just faceting, is that we can
> represent N-dim points in the points index and filter on them (using
> the points index), but we have no doc values equivalent. This means,
> 1) we can't facet, and 2) we can't create a "slow" query that does
> post-filtering instead of using the points index (which could be a
> very real advantage in cases with a sparse match set but a dense
> points index). So I like the idea of creating that concept and being
> able to facet and filter on it. Whether-or-not this is a "formal" doc
> values type or sits on top of BDV, I have less of a strong opinion.
>
> And finally... it really should be multi-valued. The points index
> supports multiple points-per-field within a single document. Seems
> like a big gap that we wouldn't support that with a doc value field.
> Because BDV is inherently single-valued, I propose we come up with an
> encoding scheme that encodes multiple points on top of that "single"
> BDV entry. This is where building on BDV started to feel a little icky
> to me and it seemed like it might be a good use-case for actually
> formalizing a format/encoding, but again, no strong preference. We
> could certainly do something more quickly on top of BDV and formalize
> an encoding later if/as necessary.
>
> Thanks again for the discussion so far Marc, Partrick and Rob!
>
> Cheers,
> -Greg
>
> On Tue, May 24, 2022 at 10:35 AM Patrick Zhai <zhai7631@gmail.com> wrote:
> >
> > As pointed out by Rob in the issue
> >
> >> I would also suggest to start with the simple
> separate-numeric-docvalues-fields case and use similar logic as the
> org.apache.lucene.facet.range package, just on 2-D, or maybe 3-D, N-D, etc
> >
> >
> > I think that's a preferable solution to me, because:
> > 1. It does not couple the dimensions together so that people can combine
> them freely
> > 2. It might be able to be compressed better
> >
> > Best
> >
> > On Tue, May 24, 2022 at 9:08 AM Marc D'Mello <marcd2000@gmail.com>
> wrote:
> >>
> >> Hi,
> >>
> >> Thanks for the responses! For Patrick's question, right now in faceting
> we don't have any good way to AND between two fields. I think the original
> hyper rectangle issue has a good example of a use case:
> https://issues.apache.org/jira/browse/LUCENE-10274.
> >>
> >> As for Robert's point, this feature would also allow us to use
> MultiRangeQuery in IndexOrDocValuesQuery, but MultiRangeQuery is itself in
> the sandbox module so I'm assuming that's a pretty exotic use case as well.
> I personally have no issues using BinaryDocValues for this, I was just
> wondering if it would be better to create a dedicated doc values, but it
> seems that is not that case.
> >>
> >> Thanks,
> >> Marc
> >>
> >> On Tue, May 24, 2022 at 1:27 AM Robert Muir <rcmuir@gmail.com> wrote:
> >>>
> >>> This seems really exotic feature to add a dedicated docvalues field
> for.
> >>>
> >>> We should let BINARY be the catchall for stuff like this.
> >>>
> >>> On Mon, May 23, 2022 at 10:17 PM Marc D'Mello <marcd2000@gmail.com>
> wrote:
> >>> >
> >>> > Hi,
> >>> >
> >>> > Some background: I've been working on this PR to add hyper rectangle
> faceting capabilities to Lucene facets and I needed to create a new doc
> values field to support this feature. Initially, I had a field that just
> extended BinaryDocValues, but then a discussion came up about whether to
> add a completely new DocValues field, maybe something like
> PointDocValuesField (and SortedPointDocValuesField as the multivalued
> version) to add first class support for this new field. Here is the link to
> the discussion. I think there are a few benefits to this:
> >>> >
> >>> > Formalize how we would store points as doc values rather than just
> packing points into a BinaryDocValues field in a format that could change
> at any time
> >>> > NumericDocValues enables us to create a SortedNumericDocValuesRange
> query which can be used with IndexOrDocValuesQuery to make some range
> queries more efficient. Adding this new doc values field would let us do
> the same thing with higher dimensional ranges
> >>> >
> >>> > I'm sure I could be missing some benefits, and I also am not super
> experienced with Lucene so there could be drawbacks I am missing as well
> :). From what I understand though, Lucene doesn't have a lot of DocValues
> fields and there should be some thought put into adding new ones, so I was
> wondering if I could get some feedback about the idea. Thanks!
> >>>
> >>> ---------------------------------------------------------------------
> >>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> >>> For additional commands, e-mail: dev-help@lucene.apache.org
> >>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>
Re: Adding a new PointDocValuesField [ In reply to ]
Also, there should be examples from other fields. Suppose you are
indexing map data and want to support a UI that shows "hot spots" on
the map where there is a lot of let's say ... activity of some sort.
You'd like to facet on 2-d areas.

Or for log analytics -- you want to do anomaly detection and find
regions of time and some other dimension (API endpoint, host,
whatever) that have a lot of -- events of interest. Probably could
benefit from multi-dimensional faceting?

On Wed, May 25, 2022 at 2:07 AM Patrick Zhai <zhai7631@gmail.com> wrote:
>
> Hi Greg, thanks for the explanation! The example makes perfect sense to me, I was under the impression that this was combining two independent fields and I was wrong.
>
> I'm not biased towards having or not a new field for it, but for multi-value, don't we have a SortedSetDocValuesField that works as a multi-value version of BDV?
>
> Best
> Patrick
>
> On Tue, May 24, 2022 at 9:17 PM Greg Miller <gsmiller@gmail.com> wrote:
>>
>> Thanks for the comments Patrick, but I'm not sure I'm fully
>> understanding the suggestion here. I don't see a path forward that
>> uses different fields, but maybe I'm missing something. Imagine you're
>> running an ecommerce site selling automotive parts and you need to
>> index fitment information that consists of the year + make of vehicles
>> a part fits. Imagine a set of wiper blades fit 2010 Ford vehicles and
>> 2011 Chevy vehicles (but _not_ 2011 Ford or 2010 Chevy). And let's say
>> we want to facet on products that fit a 2011 Ford. We need to make
>> sure this product does _not_ count. We can achieve this with points in
>> two dimensions (year + make), but not as two separate fields (at least
>> as far as I can come up with). A "two separate field approach" would
>> consist of indexing year and make separately, and you'd lose the
>> information that only certain combinations are valid. Am I overlooking
>> something with your suggestion? Maybe there's something we can do with
>> Lucene already that solves for this case and I'm just not aware of it?
>> That's entirely possible and I'd love to learn more if there is!
>>
>> As for MultiRangeQuery and the mention of sandbox modules, I think
>> that's a bit of a different use-case. MultiRangeQuery lets you filter
>> by a disjunction of ranges. The "multi" part doesn't relate to
>> "multiple values in a doc" (but it does support that, as do the
>> "standard" range queries).
>>
>> Where I see a gap right now, beyond just faceting, is that we can
>> represent N-dim points in the points index and filter on them (using
>> the points index), but we have no doc values equivalent. This means,
>> 1) we can't facet, and 2) we can't create a "slow" query that does
>> post-filtering instead of using the points index (which could be a
>> very real advantage in cases with a sparse match set but a dense
>> points index). So I like the idea of creating that concept and being
>> able to facet and filter on it. Whether-or-not this is a "formal" doc
>> values type or sits on top of BDV, I have less of a strong opinion.
>>
>> And finally... it really should be multi-valued. The points index
>> supports multiple points-per-field within a single document. Seems
>> like a big gap that we wouldn't support that with a doc value field.
>> Because BDV is inherently single-valued, I propose we come up with an
>> encoding scheme that encodes multiple points on top of that "single"
>> BDV entry. This is where building on BDV started to feel a little icky
>> to me and it seemed like it might be a good use-case for actually
>> formalizing a format/encoding, but again, no strong preference. We
>> could certainly do something more quickly on top of BDV and formalize
>> an encoding later if/as necessary.
>>
>> Thanks again for the discussion so far Marc, Partrick and Rob!
>>
>> Cheers,
>> -Greg
>>
>> On Tue, May 24, 2022 at 10:35 AM Patrick Zhai <zhai7631@gmail.com> wrote:
>> >
>> > As pointed out by Rob in the issue
>> >
>> >> I would also suggest to start with the simple separate-numeric-docvalues-fields case and use similar logic as the org.apache.lucene.facet.range package, just on 2-D, or maybe 3-D, N-D, etc
>> >
>> >
>> > I think that's a preferable solution to me, because:
>> > 1. It does not couple the dimensions together so that people can combine them freely
>> > 2. It might be able to be compressed better
>> >
>> > Best
>> >
>> > On Tue, May 24, 2022 at 9:08 AM Marc D'Mello <marcd2000@gmail.com> wrote:
>> >>
>> >> Hi,
>> >>
>> >> Thanks for the responses! For Patrick's question, right now in faceting we don't have any good way to AND between two fields. I think the original hyper rectangle issue has a good example of a use case: https://issues.apache.org/jira/browse/LUCENE-10274.
>> >>
>> >> As for Robert's point, this feature would also allow us to use MultiRangeQuery in IndexOrDocValuesQuery, but MultiRangeQuery is itself in the sandbox module so I'm assuming that's a pretty exotic use case as well. I personally have no issues using BinaryDocValues for this, I was just wondering if it would be better to create a dedicated doc values, but it seems that is not that case.
>> >>
>> >> Thanks,
>> >> Marc
>> >>
>> >> On Tue, May 24, 2022 at 1:27 AM Robert Muir <rcmuir@gmail.com> wrote:
>> >>>
>> >>> This seems really exotic feature to add a dedicated docvalues field for.
>> >>>
>> >>> We should let BINARY be the catchall for stuff like this.
>> >>>
>> >>> On Mon, May 23, 2022 at 10:17 PM Marc D'Mello <marcd2000@gmail.com> wrote:
>> >>> >
>> >>> > Hi,
>> >>> >
>> >>> > Some background: I've been working on this PR to add hyper rectangle faceting capabilities to Lucene facets and I needed to create a new doc values field to support this feature. Initially, I had a field that just extended BinaryDocValues, but then a discussion came up about whether to add a completely new DocValues field, maybe something like PointDocValuesField (and SortedPointDocValuesField as the multivalued version) to add first class support for this new field. Here is the link to the discussion. I think there are a few benefits to this:
>> >>> >
>> >>> > Formalize how we would store points as doc values rather than just packing points into a BinaryDocValues field in a format that could change at any time
>> >>> > NumericDocValues enables us to create a SortedNumericDocValuesRange query which can be used with IndexOrDocValuesQuery to make some range queries more efficient. Adding this new doc values field would let us do the same thing with higher dimensional ranges
>> >>> >
>> >>> > I'm sure I could be missing some benefits, and I also am not super experienced with Lucene so there could be drawbacks I am missing as well :). From what I understand though, Lucene doesn't have a lot of DocValues fields and there should be some thought put into adding new ones, so I was wondering if I could get some feedback about the idea. Thanks!
>> >>>
>> >>> ---------------------------------------------------------------------
>> >>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> >>> For additional commands, e-mail: dev-help@lucene.apache.org
>> >>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: dev-help@lucene.apache.org
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org
Re: Adding a new PointDocValuesField [ In reply to ]
On Wed, May 25, 2022 at 8:04 AM Michael Sokolov <msokolov@gmail.com> wrote:
>
> Also, there should be examples from other fields. Suppose you are
> indexing map data and want to support a UI that shows "hot spots" on
> the map where there is a lot of let's say ... activity of some sort.
> You'd like to facet on 2-d areas.

then use LatLonDocValuesField

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org
Re: Adding a new PointDocValuesField [ In reply to ]
> then use LatLonDocValuesField

Right! Actually, LatLonDocValuesField is a good example of what we're
trying to do here, but specialized to the 2D, lat/long case. It stores
a doc value representation of a lat/long point that can be used for
"slow" queries—which complement the points-based queries—(e.g.,
LLDVF#newSlowBoxQuery, LLDVF#newSlowDistanceQuery, etc.). It could
also support faceting (although I don't think an implementation
exists?). And, it's multi-valued (which it achieves by packing a
lat/long tuple into a single long value and then encoding with
SORTED_NUMERIC. I think this is actually a great example we could
follow here, and supports the idea of _not_ adding a specific DV type,
but rather building on top of BDV.

In our use-case, we'd like to generalize to N-dims and not make
assumptions about lat/long data. Because of that, I don't see a way to
pack our dims into a single long value and build on SORTED_NUMERIC, so
I think we need to have a different encoding scheme on top of BDV.
Patrick, maybe this is what you were getting at in your last comment,
but please let me know if I'm mis-interpreting.

So, circling back to Marc's original question, I would suggest we
_not_ introduce a new doc values type (at least at this time), and
build in BDV.

Cheers,
-g

On Wed, May 25, 2022 at 5:23 AM Robert Muir <rcmuir@gmail.com> wrote:
>
> On Wed, May 25, 2022 at 8:04 AM Michael Sokolov <msokolov@gmail.com> wrote:
> >
> > Also, there should be examples from other fields. Suppose you are
> > indexing map data and want to support a UI that shows "hot spots" on
> > the map where there is a lot of let's say ... activity of some sort.
> > You'd like to facet on 2-d areas.
>
> then use LatLonDocValuesField
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org
Re: Adding a new PointDocValuesField [ In reply to ]
On Wed, May 25, 2022 at 12:17 AM Greg Miller <gsmiller@gmail.com> wrote:
>
> A "two separate field approach" would
> consist of indexing year and make separately, and you'd lose the
> information that only certain combinations are valid. Am I overlooking
> something with your suggestion? Maybe there's something we can do with
> Lucene already that solves for this case and I'm just not aware of it?
> That's entirely possible and I'd love to learn more if there is!

This makes no sense to me. If there are two dimensions, there's no
difference in faceting code calling fieldA.value and fieldB.value,
than calling field.valueA and field.valueB.

In other words, doesn't make any sense to needlessly "pack dimensions
together" at docvalues level, especially for what should be a
column-stride field. There's really no difference from the app
perspective. Any issues you have here seem to be issues around facet
module and not docvalues...

>
> As for MultiRangeQuery and the mention of sandbox modules, I think
> that's a bit of a different use-case. MultiRangeQuery lets you filter
> by a disjunction of ranges. The "multi" part doesn't relate to
> "multiple values in a doc" (but it does support that, as do the
> "standard" range queries).
>
> Where I see a gap right now, beyond just faceting, is that we can
> represent N-dim points in the points index and filter on them (using
> the points index), but we have no doc values equivalent. This means,
> 1) we can't facet, and 2) we can't create a "slow" query that does
> post-filtering instead of using the points index (which could be a
> very real advantage in cases with a sparse match set but a dense
> points index). So I like the idea of creating that concept and being
> able to facet and filter on it. Whether-or-not this is a "formal" doc
> values type or sits on top of BDV, I have less of a strong opinion.

We shouldn't add new docvalues types because of "slow queries", I'm
really against that. The root problem is that points impl can't filter
well (like the inverted index can), and as a hack, docvalues "picks up
the slack". If its becoming a major issue, address this with points
directly?

>
> And finally... it really should be multi-valued. The points index
> supports multiple points-per-field within a single document. Seems
> like a big gap that we wouldn't support that with a doc value field.
> Because BDV is inherently single-valued, I propose we come up with an
> encoding scheme that encodes multiple points on top of that "single"
> BDV entry. This is where building on BDV started to feel a little icky
> to me and it seemed like it might be a good use-case for actually
> formalizing a format/encoding, but again, no strong preference. We
> could certainly do something more quickly on top of BDV and formalize
> an encoding later if/as necessary.

Doesn't matter that points index supports it. Do the use-cases make
sense? It's especially stupid that e.g. LatLonDocValueField supports
multi-values. Really? What kind of quantum documents are in multiple
locations at the same time?

The sortedset/sortednumeric exist to support use-cases on String and
int, where user wants to "sort on a multivalued field", which is
really crazy if you think about it. So they both sort the numbers at
index-time, so that you can pick a "representative" value
(min/max/median) in constant time. I think a lot of this existing
stuff is just brain-damage from the no-sql fads, alternatively we
could remove this multivalued nonsense and the crazy servers that want
to follow no-sql fads could index just the "representative value"
(min/max/median) in a single-valued field.

Sorry, I'm just not seeing a lot of strong use-cases here to justify
creating a new DV field, which we should really avoid, as its a hugely
expensive cost. I would recommend prototyping stuff with
BinaryDocValues, using the sandbox, etc. See if the features get
popular and people use them.

If they really "catch on", and we think its more efficient, then we
can think about how the stuff could be best encoded/compressed/etc.
But adding a new type should be the last resort. Adding some
specialized multi-dimensional type is IMO out of the question. It
would be a lot less horrible to just use separate DV fields, one for
each dimension. If there is *strong* compelling use-cases for
multi-valued stuff, then in the worst case we could think about
something like a UnsortedNumericDV, which would allow fieldA[0] to
align with fieldB[0] and fieldA[1] to align with fieldB[1], which
would solve the issue for faceting. Just don't allow sorting. And
probably not any "slow" query stuff too.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org
Re: Adding a new PointDocValuesField [ In reply to ]
>
> But adding a new type should be the last resort.


I did not realize that was the case, that's good to know. It seems like I
should just use BDV (which does make the code change easier/faster so I
have no issues with it).

As for Patrick's suggestion of using separate numeric fields instead of
packing them together, that actually does sound like an interesting idea, I
think the biggest issue with it though would be implementing a multivalued
version of this. As Robert pointed out, we would need an UnsortedNumericDV.

Thanks for all the feedback!


On Wed, May 25, 2022 at 8:17 AM Robert Muir <rcmuir@gmail.com> wrote:

> On Wed, May 25, 2022 at 12:17 AM Greg Miller <gsmiller@gmail.com> wrote:
> >
> > A "two separate field approach" would
> > consist of indexing year and make separately, and you'd lose the
> > information that only certain combinations are valid. Am I overlooking
> > something with your suggestion? Maybe there's something we can do with
> > Lucene already that solves for this case and I'm just not aware of it?
> > That's entirely possible and I'd love to learn more if there is!
>
> This makes no sense to me. If there are two dimensions, there's no
> difference in faceting code calling fieldA.value and fieldB.value,
> than calling field.valueA and field.valueB.
>
> In other words, doesn't make any sense to needlessly "pack dimensions
> together" at docvalues level, especially for what should be a
> column-stride field. There's really no difference from the app
> perspective. Any issues you have here seem to be issues around facet
> module and not docvalues...
>
> >
> > As for MultiRangeQuery and the mention of sandbox modules, I think
> > that's a bit of a different use-case. MultiRangeQuery lets you filter
> > by a disjunction of ranges. The "multi" part doesn't relate to
> > "multiple values in a doc" (but it does support that, as do the
> > "standard" range queries).
> >
> > Where I see a gap right now, beyond just faceting, is that we can
> > represent N-dim points in the points index and filter on them (using
> > the points index), but we have no doc values equivalent. This means,
> > 1) we can't facet, and 2) we can't create a "slow" query that does
> > post-filtering instead of using the points index (which could be a
> > very real advantage in cases with a sparse match set but a dense
> > points index). So I like the idea of creating that concept and being
> > able to facet and filter on it. Whether-or-not this is a "formal" doc
> > values type or sits on top of BDV, I have less of a strong opinion.
>
> We shouldn't add new docvalues types because of "slow queries", I'm
> really against that. The root problem is that points impl can't filter
> well (like the inverted index can), and as a hack, docvalues "picks up
> the slack". If its becoming a major issue, address this with points
> directly?
>
> >
> > And finally... it really should be multi-valued. The points index
> > supports multiple points-per-field within a single document. Seems
> > like a big gap that we wouldn't support that with a doc value field.
> > Because BDV is inherently single-valued, I propose we come up with an
> > encoding scheme that encodes multiple points on top of that "single"
> > BDV entry. This is where building on BDV started to feel a little icky
> > to me and it seemed like it might be a good use-case for actually
> > formalizing a format/encoding, but again, no strong preference. We
> > could certainly do something more quickly on top of BDV and formalize
> > an encoding later if/as necessary.
>
> Doesn't matter that points index supports it. Do the use-cases make
> sense? It's especially stupid that e.g. LatLonDocValueField supports
> multi-values. Really? What kind of quantum documents are in multiple
> locations at the same time?
>
> The sortedset/sortednumeric exist to support use-cases on String and
> int, where user wants to "sort on a multivalued field", which is
> really crazy if you think about it. So they both sort the numbers at
> index-time, so that you can pick a "representative" value
> (min/max/median) in constant time. I think a lot of this existing
> stuff is just brain-damage from the no-sql fads, alternatively we
> could remove this multivalued nonsense and the crazy servers that want
> to follow no-sql fads could index just the "representative value"
> (min/max/median) in a single-valued field.
>
> Sorry, I'm just not seeing a lot of strong use-cases here to justify
> creating a new DV field, which we should really avoid, as its a hugely
> expensive cost. I would recommend prototyping stuff with
> BinaryDocValues, using the sandbox, etc. See if the features get
> popular and people use them.
>
> If they really "catch on", and we think its more efficient, then we
> can think about how the stuff could be best encoded/compressed/etc.
> But adding a new type should be the last resort. Adding some
> specialized multi-dimensional type is IMO out of the question. It
> would be a lot less horrible to just use separate DV fields, one for
> each dimension. If there is *strong* compelling use-cases for
> multi-valued stuff, then in the worst case we could think about
> something like a UnsortedNumericDV, which would allow fieldA[0] to
> align with fieldB[0] and fieldA[1] to align with fieldB[1], which
> would solve the issue for faceting. Just don't allow sorting. And
> probably not any "slow" query stuff too.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>
Re: Adding a new PointDocValuesField [ In reply to ]
I appreciate all the feedback, but disagree that we can accomplish what
we’re trying to do here with the existing fields.

It’s not sufficient to AND together multiple fields for this use-case
because of the fact that the different dimensions can be multi-valued and
not all combinations are valid. To go back to my example, imagine wiper
blades that fit 2010 Ford vehicles and 2011 Chevy vehicles but not 2010
Chevy or 2011 Ford. You have to index the combinations, not the separate
component values. I can’t see a way to retain this information with
separate fields. Am I missing something? I guess with an “unsorted” numeric
DV type we could get there with aligned indices, as you describe, but that
seems less appealing than supporting multi-dim points directly.

I’m in agreement though that there isn’t a compelling need to add a new
field type for this. I have no problem building on BDV and putting this in
the sandbox module to start. Makes sense to me. It sounds like we’d have
consensus to take that approach and re-evaluate if there are future needs?
Any objections?

Cheers,
-g


On Wed, May 25, 2022 at 10:05 Marc D'Mello <marcd2000@gmail.com> wrote:

> But adding a new type should be the last resort.
>
>
> I did not realize that was the case, that's good to know. It seems like I
> should just use BDV (which does make the code change easier/faster so I
> have no issues with it).
>
> As for Patrick's suggestion of using separate numeric fields instead of
> packing them together, that actually does sound like an interesting idea, I
> think the biggest issue with it though would be implementing a multivalued
> version of this. As Robert pointed out, we would need an UnsortedNumericDV.
>
> Thanks for all the feedback!
>
>
> On Wed, May 25, 2022 at 8:17 AM Robert Muir <rcmuir@gmail.com> wrote:
>
>> On Wed, May 25, 2022 at 12:17 AM Greg Miller <gsmiller@gmail.com> wrote:
>> >
>> > A "two separate field approach" would
>> > consist of indexing year and make separately, and you'd lose the
>> > information that only certain combinations are valid. Am I overlooking
>> > something with your suggestion? Maybe there's something we can do with
>> > Lucene already that solves for this case and I'm just not aware of it?
>> > That's entirely possible and I'd love to learn more if there is!
>>
>> This makes no sense to me. If there are two dimensions, there's no
>> difference in faceting code calling fieldA.value and fieldB.value,
>> than calling field.valueA and field.valueB.
>>
>> In other words, doesn't make any sense to needlessly "pack dimensions
>> together" at docvalues level, especially for what should be a
>> column-stride field. There's really no difference from the app
>> perspective. Any issues you have here seem to be issues around facet
>> module and not docvalues...
>>
>> >
>> > As for MultiRangeQuery and the mention of sandbox modules, I think
>> > that's a bit of a different use-case. MultiRangeQuery lets you filter
>> > by a disjunction of ranges. The "multi" part doesn't relate to
>> > "multiple values in a doc" (but it does support that, as do the
>> > "standard" range queries).
>> >
>> > Where I see a gap right now, beyond just faceting, is that we can
>> > represent N-dim points in the points index and filter on them (using
>> > the points index), but we have no doc values equivalent. This means,
>> > 1) we can't facet, and 2) we can't create a "slow" query that does
>> > post-filtering instead of using the points index (which could be a
>> > very real advantage in cases with a sparse match set but a dense
>> > points index). So I like the idea of creating that concept and being
>> > able to facet and filter on it. Whether-or-not this is a "formal" doc
>> > values type or sits on top of BDV, I have less of a strong opinion.
>>
>> We shouldn't add new docvalues types because of "slow queries", I'm
>> really against that. The root problem is that points impl can't filter
>> well (like the inverted index can), and as a hack, docvalues "picks up
>> the slack". If its becoming a major issue, address this with points
>> directly?
>>
>> >
>> > And finally... it really should be multi-valued. The points index
>> > supports multiple points-per-field within a single document. Seems
>> > like a big gap that we wouldn't support that with a doc value field.
>> > Because BDV is inherently single-valued, I propose we come up with an
>> > encoding scheme that encodes multiple points on top of that "single"
>> > BDV entry. This is where building on BDV started to feel a little icky
>> > to me and it seemed like it might be a good use-case for actually
>> > formalizing a format/encoding, but again, no strong preference. We
>> > could certainly do something more quickly on top of BDV and formalize
>> > an encoding later if/as necessary.
>>
>> Doesn't matter that points index supports it. Do the use-cases make
>> sense? It's especially stupid that e.g. LatLonDocValueField supports
>> multi-values. Really? What kind of quantum documents are in multiple
>> locations at the same time?
>>
>> The sortedset/sortednumeric exist to support use-cases on String and
>> int, where user wants to "sort on a multivalued field", which is
>> really crazy if you think about it. So they both sort the numbers at
>> index-time, so that you can pick a "representative" value
>> (min/max/median) in constant time. I think a lot of this existing
>> stuff is just brain-damage from the no-sql fads, alternatively we
>> could remove this multivalued nonsense and the crazy servers that want
>> to follow no-sql fads could index just the "representative value"
>> (min/max/median) in a single-valued field.
>>
>> Sorry, I'm just not seeing a lot of strong use-cases here to justify
>> creating a new DV field, which we should really avoid, as its a hugely
>> expensive cost. I would recommend prototyping stuff with
>> BinaryDocValues, using the sandbox, etc. See if the features get
>> popular and people use them.
>>
>> If they really "catch on", and we think its more efficient, then we
>> can think about how the stuff could be best encoded/compressed/etc.
>> But adding a new type should be the last resort. Adding some
>> specialized multi-dimensional type is IMO out of the question. It
>> would be a lot less horrible to just use separate DV fields, one for
>> each dimension. If there is *strong* compelling use-cases for
>> multi-valued stuff, then in the worst case we could think about
>> something like a UnsortedNumericDV, which would allow fieldA[0] to
>> align with fieldB[0] and fieldA[1] to align with fieldB[1], which
>> would solve the issue for faceting. Just don't allow sorting. And
>> probably not any "slow" query stuff too.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: dev-help@lucene.apache.org
>>
>>
Re: Adding a new PointDocValuesField [ In reply to ]
Read your example again and yes, that makes sense. I was only thinking in
terms of single dimensions, my bad!

On Wed, May 25, 2022 at 11:08 AM Greg Miller <gsmiller@gmail.com> wrote:

> I appreciate all the feedback, but disagree that we can accomplish what
> we’re trying to do here with the existing fields.
>
> It’s not sufficient to AND together multiple fields for this use-case
> because of the fact that the different dimensions can be multi-valued and
> not all combinations are valid. To go back to my example, imagine wiper
> blades that fit 2010 Ford vehicles and 2011 Chevy vehicles but not 2010
> Chevy or 2011 Ford. You have to index the combinations, not the separate
> component values. I can’t see a way to retain this information with
> separate fields. Am I missing something? I guess with an “unsorted” numeric
> DV type we could get there with aligned indices, as you describe, but that
> seems less appealing than supporting multi-dim points directly.
>
> I’m in agreement though that there isn’t a compelling need to add a new
> field type for this. I have no problem building on BDV and putting this in
> the sandbox module to start. Makes sense to me. It sounds like we’d have
> consensus to take that approach and re-evaluate if there are future needs?
> Any objections?
>
> Cheers,
> -g
>
>
> On Wed, May 25, 2022 at 10:05 Marc D'Mello <marcd2000@gmail.com> wrote:
>
>> But adding a new type should be the last resort.
>>
>>
>> I did not realize that was the case, that's good to know. It seems like I
>> should just use BDV (which does make the code change easier/faster so I
>> have no issues with it).
>>
>> As for Patrick's suggestion of using separate numeric fields instead of
>> packing them together, that actually does sound like an interesting idea, I
>> think the biggest issue with it though would be implementing a multivalued
>> version of this. As Robert pointed out, we would need an UnsortedNumericDV.
>>
>> Thanks for all the feedback!
>>
>>
>> On Wed, May 25, 2022 at 8:17 AM Robert Muir <rcmuir@gmail.com> wrote:
>>
>>> On Wed, May 25, 2022 at 12:17 AM Greg Miller <gsmiller@gmail.com> wrote:
>>> >
>>> > A "two separate field approach" would
>>> > consist of indexing year and make separately, and you'd lose the
>>> > information that only certain combinations are valid. Am I overlooking
>>> > something with your suggestion? Maybe there's something we can do with
>>> > Lucene already that solves for this case and I'm just not aware of it?
>>> > That's entirely possible and I'd love to learn more if there is!
>>>
>>> This makes no sense to me. If there are two dimensions, there's no
>>> difference in faceting code calling fieldA.value and fieldB.value,
>>> than calling field.valueA and field.valueB.
>>>
>>> In other words, doesn't make any sense to needlessly "pack dimensions
>>> together" at docvalues level, especially for what should be a
>>> column-stride field. There's really no difference from the app
>>> perspective. Any issues you have here seem to be issues around facet
>>> module and not docvalues...
>>>
>>> >
>>> > As for MultiRangeQuery and the mention of sandbox modules, I think
>>> > that's a bit of a different use-case. MultiRangeQuery lets you filter
>>> > by a disjunction of ranges. The "multi" part doesn't relate to
>>> > "multiple values in a doc" (but it does support that, as do the
>>> > "standard" range queries).
>>> >
>>> > Where I see a gap right now, beyond just faceting, is that we can
>>> > represent N-dim points in the points index and filter on them (using
>>> > the points index), but we have no doc values equivalent. This means,
>>> > 1) we can't facet, and 2) we can't create a "slow" query that does
>>> > post-filtering instead of using the points index (which could be a
>>> > very real advantage in cases with a sparse match set but a dense
>>> > points index). So I like the idea of creating that concept and being
>>> > able to facet and filter on it. Whether-or-not this is a "formal" doc
>>> > values type or sits on top of BDV, I have less of a strong opinion.
>>>
>>> We shouldn't add new docvalues types because of "slow queries", I'm
>>> really against that. The root problem is that points impl can't filter
>>> well (like the inverted index can), and as a hack, docvalues "picks up
>>> the slack". If its becoming a major issue, address this with points
>>> directly?
>>>
>>> >
>>> > And finally... it really should be multi-valued. The points index
>>> > supports multiple points-per-field within a single document. Seems
>>> > like a big gap that we wouldn't support that with a doc value field.
>>> > Because BDV is inherently single-valued, I propose we come up with an
>>> > encoding scheme that encodes multiple points on top of that "single"
>>> > BDV entry. This is where building on BDV started to feel a little icky
>>> > to me and it seemed like it might be a good use-case for actually
>>> > formalizing a format/encoding, but again, no strong preference. We
>>> > could certainly do something more quickly on top of BDV and formalize
>>> > an encoding later if/as necessary.
>>>
>>> Doesn't matter that points index supports it. Do the use-cases make
>>> sense? It's especially stupid that e.g. LatLonDocValueField supports
>>> multi-values. Really? What kind of quantum documents are in multiple
>>> locations at the same time?
>>>
>>> The sortedset/sortednumeric exist to support use-cases on String and
>>> int, where user wants to "sort on a multivalued field", which is
>>> really crazy if you think about it. So they both sort the numbers at
>>> index-time, so that you can pick a "representative" value
>>> (min/max/median) in constant time. I think a lot of this existing
>>> stuff is just brain-damage from the no-sql fads, alternatively we
>>> could remove this multivalued nonsense and the crazy servers that want
>>> to follow no-sql fads could index just the "representative value"
>>> (min/max/median) in a single-valued field.
>>>
>>> Sorry, I'm just not seeing a lot of strong use-cases here to justify
>>> creating a new DV field, which we should really avoid, as its a hugely
>>> expensive cost. I would recommend prototyping stuff with
>>> BinaryDocValues, using the sandbox, etc. See if the features get
>>> popular and people use them.
>>>
>>> If they really "catch on", and we think its more efficient, then we
>>> can think about how the stuff could be best encoded/compressed/etc.
>>> But adding a new type should be the last resort. Adding some
>>> specialized multi-dimensional type is IMO out of the question. It
>>> would be a lot less horrible to just use separate DV fields, one for
>>> each dimension. If there is *strong* compelling use-cases for
>>> multi-valued stuff, then in the worst case we could think about
>>> something like a UnsortedNumericDV, which would allow fieldA[0] to
>>> align with fieldB[0] and fieldA[1] to align with fieldB[1], which
>>> would solve the issue for faceting. Just don't allow sorting. And
>>> probably not any "slow" query stuff too.
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: dev-help@lucene.apache.org
>>>
>>>
Re: Adding a new PointDocValuesField [ In reply to ]
On Wed, May 25, 2022 at 2:08 PM Greg Miller <gsmiller@gmail.com> wrote:
>
>
> I guess with an “unsorted” numeric DV type we could get there with aligned indices, as you describe, but that seems less appealing than supporting multi-dim points directly.
>

Name one technical reason why?
Unsorted would be exactly just as good, except also more general
purpose. The number of docvalues types should be kept to a strict
minimum, and should be generally useful to a variety of common
use-cases. Each type has a huge maintenance cost, and never goes away.
Every codec must implement every type.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org
Re: Adding a new PointDocValuesField [ In reply to ]
I agree that technically it's just as good. I also think it's less
clear for a user. The concept of "points" is something we've
established in Lucene, so I think it makes sense for users to think
about indexing points as a doc value as opposed to having to manage
multiple fields for all their dimensions in this sort of unsorted
field. But that's just my opinion as a user. But that's maybe a bit
philosophical at this point and I think we can "agree to disagree" for
now because...

... just to be clear, I'm _not_ suggesting we add a new doc value type
at this time. I'm not even necessarily advocating that we ever add it.
I think it's perfectly reasonable to define a new Field class that
builds on top of BDV (as Marc has done in his PR) that allows users to
add "point" fields to their documents that get indexed as doc values
(using BDV). This is very similar to LatLonDocValuesField,
LongRangeDocValuesField, etc. Is that an acceptable approach to you,
or are you advocating that we shouldn't do that and should instead
create these new "unsorted" numeric fields now? I'm even fine if we
put this in the sandbox module for now while we "kick the tires." In
fact, I think I'd advocate for that.

Thanks again for the feedback. It forced a deep examination of this
idea, which I appreciate.

Cheers,
-g

On Wed, May 25, 2022 at 11:41 AM Robert Muir <rcmuir@gmail.com> wrote:
>
> On Wed, May 25, 2022 at 2:08 PM Greg Miller <gsmiller@gmail.com> wrote:
> >
> >
> > I guess with an “unsorted” numeric DV type we could get there with aligned indices, as you describe, but that seems less appealing than supporting multi-dim points directly.
> >
>
> Name one technical reason why?
> Unsorted would be exactly just as good, except also more general
> purpose. The number of docvalues types should be kept to a strict
> minimum, and should be generally useful to a variety of common
> use-cases. Each type has a huge maintenance cost, and never goes away.
> Every codec must implement every type.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org
Re: Adding a new PointDocValuesField [ In reply to ]
On Thu, May 26, 2022 at 11:49 AM Greg Miller <gsmiller@gmail.com> wrote:
>
> I agree that technically it's just as good. I also think it's less
> clear for a user. The concept of "points" is something we've
> established in Lucene, so I think it makes sense for users to think
> about indexing points as a doc value as opposed to having to manage
> multiple fields for all their dimensions in this sort of unsorted
> field. But that's just my opinion as a user. But that's maybe a bit
> philosophical at this point and I think we can "agree to disagree" for
> now because...

Users don't deal with low level docvalues codec APIs, so I see this
"as a user" as irrelevant, sorry. Higher-level classes (e.g. Field
class) could impl it this way as implementation detail.

>
> ... just to be clear, I'm _not_ suggesting we add a new doc value type
> at this time. I'm not even necessarily advocating that we ever add it.
> I think it's perfectly reasonable to define a new Field class that
> builds on top of BDV (as Marc has done in his PR) that allows users to
> add "point" fields to their documents that get indexed as doc values
> (using BDV). This is very similar to LatLonDocValuesField,
> LongRangeDocValuesField, etc. Is that an acceptable approach to you,
> or are you advocating that we shouldn't do that and should instead
> create these new "unsorted" numeric fields now? I'm even fine if we
> put this in the sandbox module for now while we "kick the tires." In
> fact, I think I'd advocate for that.

+1 to build a field class in sandbox, using BDV behind the scenes. I
don't want to add any new DV types, trust me. I am just especially
opinionated against multidimensional stuff pushed down to docvalues
level, when it makes no sense from a DV perspective (column stride
fields). If you have 3 dimensions of numbers, at a low level it would
just make 3 columns at the end of the day anyway: IMO it would only
make codec code more complicated with no benefit. So that's why I was
listing out other alternatives.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org
Re: Adding a new PointDocValuesField [ In reply to ]
> Users don't deal with low level docvalues codec APIs, so I see this
"as a user" as irrelevant, sorry. Higher-level classes (e.g. Field
class) could impl it this way as implementation detail.

Hmm, that's a different perspective than I had, but I understand where
you're coming from and I think I agree. I think I'm so used to
directly interacting with doc values that I haven't considered this
point-of-view (that users should really commonly be interacting with
DVs). As long as we provide a higher-level Field class that abstracts
the implementation details, I think I'm on the same page with you
here.

> +1 to build a field class in sandbox, using BDV behind the scenes. I
don't want to add any new DV types, trust me. I am just especially
opinionated against multidimensional stuff pushed down to docvalues
level, when it makes no sense from a DV perspective (column stride
fields). If you have 3 dimensions of numbers, at a low level it would
just make 3 columns at the end of the day anyway: IMO it would only
make codec code more complicated with no benefit. So that's why I was
listing out other alternatives.

Got it. +1 from me as well. I think we're in agreement. Thanks for the
discussion!

Cheers,
-g

On Thu, May 26, 2022 at 9:04 AM Robert Muir <rcmuir@gmail.com> wrote:
>
> On Thu, May 26, 2022 at 11:49 AM Greg Miller <gsmiller@gmail.com> wrote:
> >
> > I agree that technically it's just as good. I also think it's less
> > clear for a user. The concept of "points" is something we've
> > established in Lucene, so I think it makes sense for users to think
> > about indexing points as a doc value as opposed to having to manage
> > multiple fields for all their dimensions in this sort of unsorted
> > field. But that's just my opinion as a user. But that's maybe a bit
> > philosophical at this point and I think we can "agree to disagree" for
> > now because...
>
> Users don't deal with low level docvalues codec APIs, so I see this
> "as a user" as irrelevant, sorry. Higher-level classes (e.g. Field
> class) could impl it this way as implementation detail.
>
> >
> > ... just to be clear, I'm _not_ suggesting we add a new doc value type
> > at this time. I'm not even necessarily advocating that we ever add it.
> > I think it's perfectly reasonable to define a new Field class that
> > builds on top of BDV (as Marc has done in his PR) that allows users to
> > add "point" fields to their documents that get indexed as doc values
> > (using BDV). This is very similar to LatLonDocValuesField,
> > LongRangeDocValuesField, etc. Is that an acceptable approach to you,
> > or are you advocating that we shouldn't do that and should instead
> > create these new "unsorted" numeric fields now? I'm even fine if we
> > put this in the sandbox module for now while we "kick the tires." In
> > fact, I think I'd advocate for that.
>
> +1 to build a field class in sandbox, using BDV behind the scenes. I
> don't want to add any new DV types, trust me. I am just especially
> opinionated against multidimensional stuff pushed down to docvalues
> level, when it makes no sense from a DV perspective (column stride
> fields). If you have 3 dimensions of numbers, at a low level it would
> just make 3 columns at the end of the day anyway: IMO it would only
> make codec code more complicated with no benefit. So that's why I was
> listing out other alternatives.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org