Mailing List Archive

Multi-valued xxValue / xxValueSource implementations?
Hi folks-

Out of curiosity, is there a reason Lucene doesn't have
implementations for concepts like DoubleValues / DoubleValuesSource
that support multiple values per document? Or maybe something like
this does exist in Lucen that I'm not aware of? I can't believe this
hasn't been a topic of discussion at least once, but I couldn't turn
up a past Jira issue.

I ask because most of the faceting implementations in Lucene allow the
user to provide their own xxValuesSource to use instead of assuming
the data is in an indexed field, but there's an inherent limitation
here forcing documents to have a single value. The faceting
implementations have all been updated to operate correctly for
multi-valued documents when referencing an indexed field, but there's
a bit of a gap here if the user wants to supply their own source.

Many thanks!

Cheers,
-Greg

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org
Re: Multi-valued xxValue / xxValueSource implementations? [ In reply to ]
Hi Greg, I think the general issue is one of the API, the ValueSource
seems really geared at returning values from single-valued fields.

IMO, for the way the API is used (e.g. sorting), it makes sense to
define a selector that works in O(1) time per-document, and use these
existing valuesources:

https://github.com/apache/lucene/blob/main/lucene/queries/src/java/org/apache/lucene/queries/function/valuesource/MultiValuedIntFieldSource.java
https://github.com/apache/lucene/blob/main/lucene/queries/src/java/org/apache/lucene/queries/function/valuesource/MultiValuedLongFieldSource.java
https://github.com/apache/lucene/blob/main/lucene/queries/src/java/org/apache/lucene/queries/function/valuesource/MultiValuedFloatFieldSource.java
https://github.com/apache/lucene/blob/main/lucene/queries/src/java/org/apache/lucene/queries/function/valuesource/MultiValuedDoubleFieldSource.java

These require that you specify a "selector" as to who will be the
"stuckee" (designated value) for the doc:
https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/search/SortedNumericSelector.java
I strongly recommend "min", as it can just read the first DV for each doc.

For terms (strings), there is a similar thing:

https://github.com/apache/lucene/blob/main/lucene/queries/src/java/org/apache/lucene/queries/function/valuesource/SortedSetFieldSource.java

And again, it has available selectors:
https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/search/SortedSetSelector.java
I would still strongly recommend "min", to just read the first DV for each doc.

On Tue, Oct 26, 2021 at 7:49 PM Greg Miller <gsmiller@gmail.com> wrote:
>
> Hi folks-
>
> Out of curiosity, is there a reason Lucene doesn't have
> implementations for concepts like DoubleValues / DoubleValuesSource
> that support multiple values per document? Or maybe something like
> this does exist in Lucen that I'm not aware of? I can't believe this
> hasn't been a topic of discussion at least once, but I couldn't turn
> up a past Jira issue.
>
> I ask because most of the faceting implementations in Lucene allow the
> user to provide their own xxValuesSource to use instead of assuming
> the data is in an indexed field, but there's an inherent limitation
> here forcing documents to have a single value. The faceting
> implementations have all been updated to operate correctly for
> multi-valued documents when referencing an indexed field, but there's
> a bit of a gap here if the user wants to supply their own source.
>
> Many thanks!
>
> Cheers,
> -Greg
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org
Re: Multi-valued xxValue / xxValueSource implementations? [ In reply to ]
A little history may help...

(this is based on my bad memory, so it could all be wrong, nobody get offended):

At the time, lucene could only sort single valued fields. But solr and
elasticsearch would happily sort on multi-valued docs in various hacky
ways. And this typically entailed large amounts of memory to do it.
IMO, it was important to get docvalues working for most use-cases, but
this "sorting on multi-valued field" was a tricky one, because to me
it is MATHEMATICAL NONSENSE.

But it seemed nobody really cared about how the sorting worked (again
it is MATHEMATICALLY INSANE anyway), rather just, that users didn't
have to confess if their fields were single-valued or multi-valued. So
they did stuff like substitute min value for a forward sort, or max
value for a reverse sort. These selectors allow you to implement such
a sort if you want. Hopefully MIN is the default and common case, and
you only need MAX in the rare case someone clicks an arrow to reverse
the sort, as it requires consuming all the ordinals for each doc :)

On Tue, Oct 26, 2021 at 8:01 PM Robert Muir <rcmuir@gmail.com> wrote:
>
> Hi Greg, I think the general issue is one of the API, the ValueSource
> seems really geared at returning values from single-valued fields.
>
> IMO, for the way the API is used (e.g. sorting), it makes sense to
> define a selector that works in O(1) time per-document, and use these
> existing valuesources:
>
> https://github.com/apache/lucene/blob/main/lucene/queries/src/java/org/apache/lucene/queries/function/valuesource/MultiValuedIntFieldSource.java
> https://github.com/apache/lucene/blob/main/lucene/queries/src/java/org/apache/lucene/queries/function/valuesource/MultiValuedLongFieldSource.java
> https://github.com/apache/lucene/blob/main/lucene/queries/src/java/org/apache/lucene/queries/function/valuesource/MultiValuedFloatFieldSource.java
> https://github.com/apache/lucene/blob/main/lucene/queries/src/java/org/apache/lucene/queries/function/valuesource/MultiValuedDoubleFieldSource.java
>
> These require that you specify a "selector" as to who will be the
> "stuckee" (designated value) for the doc:
> https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/search/SortedNumericSelector.java
> I strongly recommend "min", as it can just read the first DV for each doc.
>
> For terms (strings), there is a similar thing:
>
> https://github.com/apache/lucene/blob/main/lucene/queries/src/java/org/apache/lucene/queries/function/valuesource/SortedSetFieldSource.java
>
> And again, it has available selectors:
> https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/search/SortedSetSelector.java
> I would still strongly recommend "min", to just read the first DV for each doc.
>
> On Tue, Oct 26, 2021 at 7:49 PM Greg Miller <gsmiller@gmail.com> wrote:
> >
> > Hi folks-
> >
> > Out of curiosity, is there a reason Lucene doesn't have
> > implementations for concepts like DoubleValues / DoubleValuesSource
> > that support multiple values per document? Or maybe something like
> > this does exist in Lucen that I'm not aware of? I can't believe this
> > hasn't been a topic of discussion at least once, but I couldn't turn
> > up a past Jira issue.
> >
> > I ask because most of the faceting implementations in Lucene allow the
> > user to provide their own xxValuesSource to use instead of assuming
> > the data is in an indexed field, but there's an inherent limitation
> > here forcing documents to have a single value. The faceting
> > implementations have all been updated to operate correctly for
> > multi-valued documents when referencing an indexed field, but there's
> > a bit of a gap here if the user wants to supply their own source.
> >
> > Many thanks!
> >
> > Cheers,
> > -Greg
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: dev-help@lucene.apache.org
> >

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org
Re: Multi-valued xxValue / xxValueSource implementations? [ In reply to ]
On Tue, Oct 26, 2021 at 8:01 PM Robert Muir <rcmuir@gmail.com> wrote:
>
> Hi Greg, I think the general issue is one of the API, the ValueSource
> seems really geared at returning values from single-valued fields.

I think really, this is the core issue. This ValueSource thing was
created before the days of docvalues, in a lot of cases will do
inefficient things depending on how you hold it.

I feel that things like facets apis should really try to move to
lower-level apis (DoubleValuesSource, SortedSetDocValues, etc)

Reverse the problem around from push to a pull, now if you want to
give "computed field" or similar inputs to faceting (e.g. some kind of
filtering-on-the-fly), you have the chance to implement it
efficiently.
The expressions module switched away from this ValueSource to a
DoubleValues/DoubleValuesSource already, though I didn't follow
specific reasons why.
Maybe similar approaches apply to all the numerics.

As far as the strings, personally, I'm not sure what a ValueSource API
that "filters/transforms" terms should look like. Seems slow no matter
how you do it. But maybe fresh ideas are needed.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org
Re: Multi-valued xxValue / xxValueSource implementations? [ In reply to ]
Thanks Robert for all your thoughts and context!

> I feel that things like facets apis should really try to move to lower-level apis (DoubleValuesSource, SortedSetDocValues, etc)

Yeah I think this direction generally makes sense. All the cases I can
think of where a user might want to provide custom values (e.g.,
filtering, transforming, etc.) could be solved by allowing users to
pass their own xxDocValues instance into faceting implementations. For
example, if a user wanted to provide some filtering or transformation
on long values before counting them with LongValueFacetCounts, they
could do so by creating their own SortedNumericDocValues /
NumericDocValues implementations and passing them in if the faceting
implementations supported this.

The only possible gap I see here is that implementing xxDocValues
requires the ability to provide iteration over the documents
themselves, whereas xxValuesSource doesn't. So if there was some case
where a user wanted to provide multi-valued data but couldn't provide
document iteration, that might be an issue. It's a bit of a funny
limitation since faceting doesn't need the value source to lead
iteration, so I could see a multi-valued version of something like
LongValuesSource maybe being a better fit.

Cheers,
-Greg

On Tue, Oct 26, 2021 at 8:03 PM Robert Muir <rcmuir@gmail.com> wrote:
>
> On Tue, Oct 26, 2021 at 8:01 PM Robert Muir <rcmuir@gmail.com> wrote:
> >
> > Hi Greg, I think the general issue is one of the API, the ValueSource
> > seems really geared at returning values from single-valued fields.
>
> I think really, this is the core issue. This ValueSource thing was
> created before the days of docvalues, in a lot of cases will do
> inefficient things depending on how you hold it.
>
> I feel that things like facets apis should really try to move to
> lower-level apis (DoubleValuesSource, SortedSetDocValues, etc)
>
> Reverse the problem around from push to a pull, now if you want to
> give "computed field" or similar inputs to faceting (e.g. some kind of
> filtering-on-the-fly), you have the chance to implement it
> efficiently.
> The expressions module switched away from this ValueSource to a
> DoubleValues/DoubleValuesSource already, though I didn't follow
> specific reasons why.
> Maybe similar approaches apply to all the numerics.
>
> As far as the strings, personally, I'm not sure what a ValueSource API
> that "filters/transforms" terms should look like. Seems slow no matter
> how you do it. But maybe fresh ideas are needed.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org