Mailing List Archive: Getting all values for a specific dimension for SortedSetDocValues per document

Getting all values for a specific dimension for SortedSetDocValues per document

Jun 30, 2022, 11:52 AM

Post #1 of 5 (387 views)

Hi!

I'm looking for a solution for the following problem:

I would like to get all the values for a specific dimension for
SortedSetDocValues per document. I've basically copied
SortedSetDocValuesFacetCounts, but instead of just counting, I build a
map from doc to values. The problem here is, that I have to iterate
through all ords and map them to global ords to check if they belong to
the desired dimension. And this is very inefficient. Is there a better
way, to iterate through documents and get all the values for a specific
dimension?

Here is a simplified code of what I'm doing now:

SortedSetDocValuesReaderState state;

LongValues segOrdMap = ordinalMap.getGlobalOrds(segOrd);
SortedSetDocValues it = DocValues.getSortedSet(reader, field);

OrdRange dimOrdRange = state.getOrdRange(dim);

for (int doc = it.nextDoc();...)
for (long term = it.nextOrd(); ...)
long globalTerm = segOrdMap.get(term) : term;
if (globalTerm >= dimOrdRange.start &&
globalTerm <= dimOrdRange end) {
// add doc/globalTerm
}
}
}

Am I on the completely wrong path here?

Thanks in advance and regards
harry

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Getting all values for a specific dimension for SortedSetDocValues per document [ In reply to ]

gsmiller at gmail

Jun 30, 2022, 3:46 PM

Post #2 of 5 (387 views)

Permalink

Hi Harry-

Have you considered taxonomy faceting for your use-case? Because the
taxonomy structure is maintained in a separate index, it's
(relatively) trivial to iterate all direct child ordinals of a given
dimension. The cost of mapping to a global ordinal space is done when
the index is merged.

Separately, I'd be curious about where you're running into performance
issues within the context of your system. Is the cost you're concerned
with building up the ordinal map? That's certainly expensive, but it's
a one-time cost (until you refresh your index). Or are you concerned
with the actual map lookup within your tight loop? If the latter, you
could consider doing more work at the slice-level by separately
determining the child ords for each dim ord within the context of each
segment (there's no off-the-shelf code for this that I'm aware of, so
you'd have to roll your own).

Cheers,
-Greg

On Thu, Jun 30, 2022 at 11:52 AM Harald Braumann <braumann@m2n.at> wrote:
>
> Hi!
>
> I'm looking for a solution for the following problem:
>
> I would like to get all the values for a specific dimension for
> SortedSetDocValues per document. I've basically copied
> SortedSetDocValuesFacetCounts, but instead of just counting, I build a
> map from doc to values. The problem here is, that I have to iterate
> through all ords and map them to global ords to check if they belong to
> the desired dimension. And this is very inefficient. Is there a better
> way, to iterate through documents and get all the values for a specific
> dimension?
>
> Here is a simplified code of what I'm doing now:
>
> SortedSetDocValuesReaderState state;
>
> LongValues segOrdMap = ordinalMap.getGlobalOrds(segOrd);
> SortedSetDocValues it = DocValues.getSortedSet(reader, field);
>
> OrdRange dimOrdRange = state.getOrdRange(dim);
>
> for (int doc = it.nextDoc();...)
> for (long term = it.nextOrd(); ...)
> long globalTerm = segOrdMap.get(term) : term;
> if (globalTerm >= dimOrdRange.start &&
> globalTerm <= dimOrdRange end) {
> // add doc/globalTerm
> }
> }
> }
>
> Am I on the completely wrong path here?
>
> Thanks in advance and regards
> harry
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Getting all values for a specific dimension for SortedSetDocValues per document [ In reply to ]

braumann at m2n

Jul 1, 2022, 2:35 AM

Post #3 of 5 (387 views)

Permalink

Hi!

On 01.07.22 00:46, Greg Miller wrote:
> Have you considered taxonomy faceting for your use-case? Because the
> taxonomy structure is maintained in a separate index, it's
> (relatively) trivial to iterate all direct child ordinals of a given
> dimension. The cost of mapping to a global ordinal space is done when
> the index is merged.

Thanks for the tip. I will certainly look into it.

> Separately, I'd be curious about where you're running into performance
> issues within the context of your system. Is the cost you're concerned
> with building up the ordinal map? That's certainly expensive, but it's
> a one-time cost (until you refresh your index).

That's not the problem. The index changes rarely during operation, anyways.

> Or are you concerned
> with the actual map lookup within your tight loop?

Yes. The index I tested has about 3M documents and overall about 180M
doc value ords. Just iterating, without retrieving the actual values or
building the result map takes close to 3s. All the time seems to be
spent in SortedSetDocValues.nextOrd and LongValues.get.

> If the latter, you
> could consider doing more work at the slice-level by separately
> determining the child ords for each dim ord within the context of each
> segment (there's no off-the-shelf code for this that I'm aware of, so
> you'd have to roll your own).

I was thinking about this as well. So lets say, I have the ord ranges
for dimensions per segment. I guess, I could easily test this by making
sure, there is only one segment, so I can use the global ord ranges I
already got. What would be the best way to jump to those ords for each
document? If I use SortedSetDocValues, I'd still had to iterator through
all ords per document, right? Maybe that's not really the problem. I've
tried to time this, but both the profiler and hand crafted timing code
massively skew the results, so I'm not sure, I trust those measurements.

Cheers
harry

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Getting all values for a specific dimension for SortedSetDocValues per document [ In reply to ]

gsmiller at gmail

Jul 1, 2022, 9:58 AM

Post #4 of 5 (387 views)

Permalink

To address the last topic (building up ordinal ranges per-segment),
what I'm thinking is that you'd iterate all unique ordinals in the
SSDV field and "memorize" the ordinal range for each dimension
up-front, but on a per-segment basis. This would be very similar to
what DefaultSortedSetDocValuesReaderState#createOneFlatFacetDimState
is doing, but you'd do it per-segment. That per-segment state
information would only need to be done once, since segments are
immutable. Then, when you're iterating all your hits within each
segment, you can determine whether-or-not each stored doc value for
the doc is within the dim range using your segment-level ordinal range
information, avoiding the need to do the global mapping. You can
short-circuit the document-level ordinal iteration once an ordinal
goes beyond the range since the ordinals are stored in sorted order,
but in general, you're still doing a linear operation per document. No
way to avoid that. But you can also resolve the BytesRef within the
context of each segment, completely avoiding the need for a global
mapping.

Not sure this will result in any major performance improvements for
you, but it does let you avoid building a global ordinal map and doing
map lookups within the tight loop.

Cheers,
-Greg

On Fri, Jul 1, 2022 at 2:35 AM Harald Braumann <braumann@m2n.at> wrote:
>
> Hi!
>
> On 01.07.22 00:46, Greg Miller wrote:
> > Have you considered taxonomy faceting for your use-case? Because the
> > taxonomy structure is maintained in a separate index, it's
> > (relatively) trivial to iterate all direct child ordinals of a given
> > dimension. The cost of mapping to a global ordinal space is done when
> > the index is merged.
>
> Thanks for the tip. I will certainly look into it.
>
> > Separately, I'd be curious about where you're running into performance
> > issues within the context of your system. Is the cost you're concerned
> > with building up the ordinal map? That's certainly expensive, but it's
> > a one-time cost (until you refresh your index).
>
> That's not the problem. The index changes rarely during operation, anyways.
>
> > Or are you concerned
> > with the actual map lookup within your tight loop?
>
> Yes. The index I tested has about 3M documents and overall about 180M
> doc value ords. Just iterating, without retrieving the actual values or
> building the result map takes close to 3s. All the time seems to be
> spent in SortedSetDocValues.nextOrd and LongValues.get.
>
> > If the latter, you
> > could consider doing more work at the slice-level by separately
> > determining the child ords for each dim ord within the context of each
> > segment (there's no off-the-shelf code for this that I'm aware of, so
> > you'd have to roll your own).
>
> I was thinking about this as well. So lets say, I have the ord ranges
> for dimensions per segment. I guess, I could easily test this by making
> sure, there is only one segment, so I can use the global ord ranges I
> already got. What would be the best way to jump to those ords for each
> document? If I use SortedSetDocValues, I'd still had to iterator through
> all ords per document, right? Maybe that's not really the problem. I've
> tried to time this, but both the profiler and hand crafted timing code
> massively skew the results, so I'm not sure, I trust those measurements.
>
> Cheers
> harry
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Getting all values for a specific dimension for SortedSetDocValues per document [ In reply to ]

braumann at m2n

Jul 1, 2022, 10:07 AM

Post #5 of 5 (387 views)

Permalink

Hi!

Thanks a lot for your help. I will try both of you suggestions (taxo
index and per-segment ord ranges).

Thanks for clarifying, that I have to iterate the ords. I wasn't sure,
if I didn't just overlook something obvious. Like some way to do an
advanceExact on ords.

Regards
harry

On 01.07.22 18:58, Greg Miller wrote:
> To address the last topic (building up ordinal ranges per-segment),
> what I'm thinking is that you'd iterate all unique ordinals in the
> SSDV field and "memorize" the ordinal range for each dimension
> up-front, but on a per-segment basis. This would be very similar to
> what DefaultSortedSetDocValuesReaderState#createOneFlatFacetDimState
> is doing, but you'd do it per-segment. That per-segment state
> information would only need to be done once, since segments are
> immutable. Then, when you're iterating all your hits within each
> segment, you can determine whether-or-not each stored doc value for
> the doc is within the dim range using your segment-level ordinal range
> information, avoiding the need to do the global mapping. You can
> short-circuit the document-level ordinal iteration once an ordinal
> goes beyond the range since the ordinals are stored in sorted order,
> but in general, you're still doing a linear operation per document. No
> way to avoid that. But you can also resolve the BytesRef within the
> context of each segment, completely avoiding the need for a global
> mapping.
>
> Not sure this will result in any major performance improvements for
> you, but it does let you avoid building a global ordinal map and doing
> map lookups within the tight loop.
>
> Cheers,
> -Greg
>
> On Fri, Jul 1, 2022 at 2:35 AM Harald Braumann <braumann@m2n.at> wrote:
>>
>> Hi!
>>
>> On 01.07.22 00:46, Greg Miller wrote:
>>> Have you considered taxonomy faceting for your use-case? Because the
>>> taxonomy structure is maintained in a separate index, it's
>>> (relatively) trivial to iterate all direct child ordinals of a given
>>> dimension. The cost of mapping to a global ordinal space is done when
>>> the index is merged.
>>
>> Thanks for the tip. I will certainly look into it.
>>
>>> Separately, I'd be curious about where you're running into performance
>>> issues within the context of your system. Is the cost you're concerned
>>> with building up the ordinal map? That's certainly expensive, but it's
>>> a one-time cost (until you refresh your index).
>>
>> That's not the problem. The index changes rarely during operation, anyways.
>>
>>> Or are you concerned
>>> with the actual map lookup within your tight loop?
>>
>> Yes. The index I tested has about 3M documents and overall about 180M
>> doc value ords. Just iterating, without retrieving the actual values or
>> building the result map takes close to 3s. All the time seems to be
>> spent in SortedSetDocValues.nextOrd and LongValues.get.
>>
>>> If the latter, you
>>> could consider doing more work at the slice-level by separately
>>> determining the child ords for each dim ord within the context of each
>>> segment (there's no off-the-shelf code for this that I'm aware of, so
>>> you'd have to roll your own).
>>
>> I was thinking about this as well. So lets say, I have the ord ranges
>> for dimensions per segment. I guess, I could easily test this by making
>> sure, there is only one segment, so I can use the global ord ranges I
>> already got. What would be the best way to jump to those ords for each
>> document? If I use SortedSetDocValues, I'd still had to iterator through
>> all ords per document, right? Maybe that's not really the problem. I've
>> tried to time this, but both the profiler and hand crafted timing code
>> massively skew the results, so I'm not sure, I trust those measurements.
>>
>> Cheers
>> harry
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org