Mailing List Archive

Any potential benefits to a SSDV#bulkLookupOrd(long ord) impl?
Hey folks-

I've been chatting with Marc D'Mello a bit about the SSDV faceting
he's working on (LUCENE-10250) (disclaimer: we both work on Amazon's
Product Search engine). We're trying to figure out where
taxonomy-based faceting has a performance advantage over SSDV, and it
occurred to me that the way the two approaches resolve the paths for
given ordinals is a bit different. TaxonomyReader was recently updated
to support bulk ordinal resolution (LUCENE-9476), but SSDV faceting is
stuck looking up paths one-at-a-time via SSDV#lookupOrd(ord). This
results in a separate TermsEnum#seekExact() call down in
Lucene90DocValuesProducer for each ordinal being returned.

Having no knowledge about the actual data representation behind the
TermsDict in an SSDV field, I'm wondering if someone here can provide
a high-level sense of whether-or-not there might be an advantage to
looking up ordinals in bulk. I'm going to dig into the code anyway
(curious!), but thought I'd raise the idea/question here as well
regarding whether-or-not a bulk lookup might be advantageous in
general for SSDV fields. Any thoughts?

Cheers,
-Greg

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org
Re: Any potential benefits to a SSDV#bulkLookupOrd(long ord) impl? [ In reply to ]
On Thu, Dec 16, 2021 at 3:53 PM Greg Miller <gsmiller@gmail.com> wrote:
>

> TaxonomyReader was recently updated
> to support bulk ordinal resolution (LUCENE-9476), but SSDV faceting is
> stuck looking up paths one-at-a-time via SSDV#lookupOrd(ord). This
> results in a separate TermsEnum#seekExact() call down in
> Lucene90DocValuesProducer for each ordinal being returned.
>

I'm confused, where do we do gazillions of lookupOrd(), we should not
be doing that. The ordinals should be used for all the heavy-duty
work, and at the very end, only the top-10 or whatever resolved back
to strings with lookupOrd. Think of it kinda like the stored fields :)

> Having no knowledge about the actual data representation behind the
> TermsDict in an SSDV field, I'm wondering if someone here can provide
> a high-level sense of whether-or-not there might be an advantage to
> looking up ordinals in bulk. I'm going to dig into the code anyway
> (curious!), but thought I'd raise the idea/question here as well
> regarding whether-or-not a bulk lookup might be advantageous in
> general for SSDV fields. Any thoughts?

I don't think we should provide such an API, because the operation is
slow and should not be done in "bulk" anyway. Number of lookups should
be low (e.g. 10, 50, whatever the user's top-N is). If you want to
optimize it, sort them in ascending order and look that up first, but
honestly in most cases, that probably isn't even worth it.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org
Re: Any potential benefits to a SSDV#bulkLookupOrd(long ord) impl? [ In reply to ]
On Thu, Dec 16, 2021 at 1:31 PM Robert Muir <rcmuir@gmail.com> wrote:
>
> On Thu, Dec 16, 2021 at 3:53 PM Greg Miller <gsmiller@gmail.com> wrote:
> >
>
> > TaxonomyReader was recently updated
> > to support bulk ordinal resolution (LUCENE-9476), but SSDV faceting is
> > stuck looking up paths one-at-a-time via SSDV#lookupOrd(ord). This
> > results in a separate TermsEnum#seekExact() call down in
> > Lucene90DocValuesProducer for each ordinal being returned.
> >
>
> I'm confused, where do we do gazillions of lookupOrd(), we should not
> be doing that. The ordinals should be used for all the heavy-duty
> work, and at the very end, only the top-10 or whatever resolved back
> to strings with lookupOrd. Think of it kinda like the stored fields :)

This is right, but we still need to do the lookup for each value being
returned (which is bounded by the top-n param supplied by the user).
In getAllDims, we'll do "n" lookups for every dimension indexed. So
while we're working in "ordinal space" for doing all the counting and
such, there could still be a somewhat sizable number of ordinals that
need to be looked up after counting. This is where taxo-faceting leans
on bulk lookups.

We also call lookupOrd for _every_ ordinal in the given field when
building the state (see the ctor logic in
DefaultSortedSetDocValuesReaderState). I'm not as concerned about this
since state building only needs to happen when the index changes.

>
> > Having no knowledge about the actual data representation behind the
> > TermsDict in an SSDV field, I'm wondering if someone here can provide
> > a high-level sense of whether-or-not there might be an advantage to
> > looking up ordinals in bulk. I'm going to dig into the code anyway
> > (curious!), but thought I'd raise the idea/question here as well
> > regarding whether-or-not a bulk lookup might be advantageous in
> > general for SSDV fields. Any thoughts?
>
> I don't think we should provide such an API, because the operation is
> slow and should not be done in "bulk" anyway. Number of lookups should
> be low (e.g. 10, 50, whatever the user's top-N is). If you want to
> optimize it, sort them in ascending order and look that up first, but
> honestly in most cases, that probably isn't even worth it.

That's fair. I can see the argument for not wanting to encourage
unnecessary lookups with a "bulk" operation. Thanks for the feedback.
I'll think about this a little more when I have some time to dig into
the code, but what you're saying sounds reasonable.

>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org
Re: Any potential benefits to a SSDV#bulkLookupOrd(long ord) impl? [ In reply to ]
On Thu, Dec 16, 2021 at 5:05 PM Greg Miller <gsmiller@gmail.com> wrote:
>
> On Thu, Dec 16, 2021 at 1:31 PM Robert Muir <rcmuir@gmail.com> wrote:
> >
> > On Thu, Dec 16, 2021 at 3:53 PM Greg Miller <gsmiller@gmail.com> wrote:
> > >
> >
> > > TaxonomyReader was recently updated
> > > to support bulk ordinal resolution (LUCENE-9476), but SSDV faceting is
> > > stuck looking up paths one-at-a-time via SSDV#lookupOrd(ord). This
> > > results in a separate TermsEnum#seekExact() call down in
> > > Lucene90DocValuesProducer for each ordinal being returned.
> > >
> >
> > I'm confused, where do we do gazillions of lookupOrd(), we should not
> > be doing that. The ordinals should be used for all the heavy-duty
> > work, and at the very end, only the top-10 or whatever resolved back
> > to strings with lookupOrd. Think of it kinda like the stored fields :)
>
> This is right, but we still need to do the lookup for each value being
> returned (which is bounded by the top-n param supplied by the user).
> In getAllDims, we'll do "n" lookups for every dimension indexed. So
> while we're working in "ordinal space" for doing all the counting and
> such, there could still be a somewhat sizable number of ordinals that
> need to be looked up after counting. This is where taxo-faceting leans
> on bulk lookups.

OK I need to understand this better, because I don't see why it is
necessary to do it this way. It definitely is very different from the
way solr wiki page documents hierarchical faceting. Maybe we should
adopt their approach?

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org
Re: Any potential benefits to a SSDV#bulkLookupOrd(long ord) impl? [ In reply to ]
On Thu, Dec 16, 2021 at 2:29 PM Robert Muir <rcmuir@gmail.com> wrote:
>
> On Thu, Dec 16, 2021 at 5:05 PM Greg Miller <gsmiller@gmail.com> wrote:
> >
> > On Thu, Dec 16, 2021 at 1:31 PM Robert Muir <rcmuir@gmail.com> wrote:
> > >
> > > On Thu, Dec 16, 2021 at 3:53 PM Greg Miller <gsmiller@gmail.com> wrote:
> > > >
> > >
> > > > TaxonomyReader was recently updated
> > > > to support bulk ordinal resolution (LUCENE-9476), but SSDV faceting is
> > > > stuck looking up paths one-at-a-time via SSDV#lookupOrd(ord). This
> > > > results in a separate TermsEnum#seekExact() call down in
> > > > Lucene90DocValuesProducer for each ordinal being returned.
> > > >
> > >
> > > I'm confused, where do we do gazillions of lookupOrd(), we should not
> > > be doing that. The ordinals should be used for all the heavy-duty
> > > work, and at the very end, only the top-10 or whatever resolved back
> > > to strings with lookupOrd. Think of it kinda like the stored fields :)
> >
> > This is right, but we still need to do the lookup for each value being
> > returned (which is bounded by the top-n param supplied by the user).
> > In getAllDims, we'll do "n" lookups for every dimension indexed. So
> > while we're working in "ordinal space" for doing all the counting and
> > such, there could still be a somewhat sizable number of ordinals that
> > need to be looked up after counting. This is where taxo-faceting leans
> > on bulk lookups.
>
> OK I need to understand this better, because I don't see why it is
> necessary to do it this way. It definitely is very different from the
> way solr wiki page documents hierarchical faceting. Maybe we should
> adopt their approach?

This is separate from adding hierarchical support. I'm probably not
communicating the current state well, but here's where SSDV faceting
does ordinal lookups:
https://github.com/apache/lucene/blob/c64e5fe84c4990968844193e3a62f4ebbba638ea/lucene/facet/src/java/org/apache/lucene/facet/sortedset/SortedSetDocValuesFacetCounts.java#L148

So this is done for every returned value, which as you describe,
scales with the requested top-n. For getAllDims, this logic is
executed for every dimension.

I don't think these lookups are avoidable since we provide the path
for each returned value, and in order to get the path, we need to
dereference the ordinal.

>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org
Re: Any potential benefits to a SSDV#bulkLookupOrd(long ord) impl? [ In reply to ]
On Thu, Dec 16, 2021 at 5:57 PM Greg Miller <gsmiller@gmail.com> wrote:
>
> This is separate from adding hierarchical support. I'm probably not
> communicating the current state well, but here's where SSDV faceting
> does ordinal lookups:
> https://github.com/apache/lucene/blob/c64e5fe84c4990968844193e3a62f4ebbba638ea/lucene/facet/src/java/org/apache/lucene/facet/sortedset/SortedSetDocValuesFacetCounts.java#L148
>
> So this is done for every returned value, which as you describe,
> scales with the requested top-n. For getAllDims, this logic is
> executed for every dimension.
>
> I don't think these lookups are avoidable since we provide the path
> for each returned value, and in order to get the path, we need to
> dereference the ordinal.
>

OK I get it. I think the strangeness (compared to e.g. solr faceting)
is that we're mixing ordinals from different fields ("dims") all into
one DV field? And then we have a trappy method to do top-N for all
possible dims in this single packed field (what if there are
thousands???).

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org
Re: Any potential benefits to a SSDV#bulkLookupOrd(long ord) impl? [ In reply to ]
On Thu, Dec 16, 2021 at 4:56 PM Robert Muir <rcmuir@gmail.com> wrote:
>
> On Thu, Dec 16, 2021 at 5:57 PM Greg Miller <gsmiller@gmail.com> wrote:
> >
> > This is separate from adding hierarchical support. I'm probably not
> > communicating the current state well, but here's where SSDV faceting
> > does ordinal lookups:
> > https://github.com/apache/lucene/blob/c64e5fe84c4990968844193e3a62f4ebbba638ea/lucene/facet/src/java/org/apache/lucene/facet/sortedset/SortedSetDocValuesFacetCounts.java#L148
> >
> > So this is done for every returned value, which as you describe,
> > scales with the requested top-n. For getAllDims, this logic is
> > executed for every dimension.
> >
> > I don't think these lookups are avoidable since we provide the path
> > for each returned value, and in order to get the path, we need to
> > dereference the ordinal.
> >
>
> OK I get it. I think the strangeness (compared to e.g. solr faceting)
> is that we're mixing ordinals from different fields ("dims") all into
> one DV field? And then we have a trappy method to do top-N for all
> possible dims in this single packed field (what if there are
> thousands???).

Right. The "get all dims" functionality does have a bit of a "trappy"
feel to it for this reason. I think there are situations where "dim
mixing" can be beneficial; if you actually do need facet counts for
most (or all) of your dims, I can see the benefit of iterating the
FacetsCollector once, counting everything in the SSDV field in the
same pass, then getting the results you need. But this is very
suboptimal if you have a large number of dims and only need faceting
on a small number of them (burn a bunch of up-front cost counting dims
you don't care about). I tried to address this by providing
StringValueFacetCounts (LUCENE-9950), which essentially chucks the
concept of "dim" altogether and assumes the field itself is the dim
(sounds like what solr does, but I need to get more familiar with that
impl). When I introduced StringValueFacetCounts, I was hesitant to
suggest deprecating what SSDV faceting does since I think there are
valid applications for wanting to pack many dims into a single field.
For what it's worth, taxonomy-based faceting operates in the same way,
defaulting to packing all the dims into one doc value field.

Anyway, not really sure where I'm going with all this except to say +1
to getAllDims being potentially trappy. I can see users thinking,
"well, I need to grab counts for a few different dims so I'll just
call getAllDims then pull out what I want instead of calling
getTopChildren for each dim." Hopefully they're not doing this, but it
would be an easy trap to fall into.

I suppose the last thing I'd say is that there are valid use-cases for
wanting the "top" dims along with their "top" children, and getAllDims
provides a reasonable way to do this. For example, in Amazon's product
search, we have a large number of different dims but only want to show
a small sub-set to customers on a search page. One way to go about
this would be to determine the "top" dims for the match set along with
the "top n" values under each; getAllDims is helpful for this but has
a bit of an unpleasant side-effect that it unnecessarily resolves the
paths for all children for all dims. As I think about this, I wonder
if a getTopDims method would be more useful that lets the user specify
the number of dims they want back along with the number of children
for each? I'll open a Jira for that.

>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org
Re: Any potential benefits to a SSDV#bulkLookupOrd(long ord) impl? [ In reply to ]
On Fri, Dec 17, 2021 at 10:02 AM Greg Miller <gsmiller@gmail.com> wrote:
>
> I suppose the last thing I'd say is that there are valid use-cases for
> wanting the "top" dims along with their "top" children, and getAllDims
> provides a reasonable way to do this. For example, in Amazon's product
> search, we have a large number of different dims but only want to show
> a small sub-set to customers on a search page. One way to go about
> this would be to determine the "top" dims for the match set along with
> the "top n" values under each; getAllDims is helpful for this but has
> a bit of an unpleasant side-effect that it unnecessarily resolves the
> paths for all children for all dims. As I think about this, I wonder
> if a getTopDims method would be more useful that lets the user specify
> the number of dims they want back along with the number of children
> for each? I'll open a Jira for that.

getTopDims() seems much more reasonable than getAllDims() for this use-case!

But still, I feel like facets is "storing" stuff in a
lucene-3.x-style-term-dictionary here. Background: before Lucene 4.0,
all the terms across all the indexed fields were stored in a single
massive dictionary, but each one coded, very similar to what facets is
doing.

We found it was better to keep fields separate.

I really think it might be the same for facets. If i have a field
"color", index it with DV so that I can both sort and facet on it. If
i have another field "size", do the same thing. If you want to facet
on both fields, facet on both fields. And you get two single-valued
fields instead of one big multi-valued field... so I'm not sure I am
convinced that "dim mixing" is typically a good thing.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org
Re: Any potential benefits to a SSDV#bulkLookupOrd(long ord) impl? [ In reply to ]
> And you get two single-valued fields instead of one big multi-valued field... so I'm not sure I am convinced that "dim mixing" is typically a good thing.

Mixing enables the user to model multiple (of their) fields within a
single Lucene field. You may have a very lightweight and
loosely-managed "schema" (think lots of data suppliers all wanting to
add their field names, but with no access to update the index schema)
that you want to model in a carefully managed set of (Lucene) index
fields.

On Fri, Dec 17, 2021 at 10:09 AM Robert Muir <rcmuir@gmail.com> wrote:
>
> On Fri, Dec 17, 2021 at 10:02 AM Greg Miller <gsmiller@gmail.com> wrote:
> >
> > I suppose the last thing I'd say is that there are valid use-cases for
> > wanting the "top" dims along with their "top" children, and getAllDims
> > provides a reasonable way to do this. For example, in Amazon's product
> > search, we have a large number of different dims but only want to show
> > a small sub-set to customers on a search page. One way to go about
> > this would be to determine the "top" dims for the match set along with
> > the "top n" values under each; getAllDims is helpful for this but has
> > a bit of an unpleasant side-effect that it unnecessarily resolves the
> > paths for all children for all dims. As I think about this, I wonder
> > if a getTopDims method would be more useful that lets the user specify
> > the number of dims they want back along with the number of children
> > for each? I'll open a Jira for that.
>
> getTopDims() seems much more reasonable than getAllDims() for this use-case!
>
> But still, I feel like facets is "storing" stuff in a
> lucene-3.x-style-term-dictionary here. Background: before Lucene 4.0,
> all the terms across all the indexed fields were stored in a single
> massive dictionary, but each one coded, very similar to what facets is
> doing.
>
> We found it was better to keep fields separate.
>
> I really think it might be the same for facets. If i have a field
> "color", index it with DV so that I can both sort and facet on it. If
> i have another field "size", do the same thing. If you want to facet
> on both fields, facet on both fields. And you get two single-valued
> fields instead of one big multi-valued field... so I'm not sure I am
> convinced that "dim mixing" is typically a good thing.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org
Re: Any potential benefits to a SSDV#bulkLookupOrd(long ord) impl? [ In reply to ]
On Fri, Dec 17, 2021 at 10:19 AM Michael Sokolov <msokolov@gmail.com> wrote:
>
> > And you get two single-valued fields instead of one big multi-valued field... so I'm not sure I am convinced that "dim mixing" is typically a good thing.
>
> Mixing enables the user to model multiple (of their) fields within a
> single Lucene field. You may have a very lightweight and
> loosely-managed "schema" (think lots of data suppliers all wanting to
> add their field names, but with no access to update the index schema)
> that you want to model in a carefully managed set of (Lucene) index
> fields.
>

I do get that you have a use-case for it. I just personally think it
is atypical.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org