Mailing List Archive

block min-max values for Sort Field with Top-N query..
Our Sort Fields utilize DocValues..

Lets say I collect min-max ords of a Sort Field for a block of documents
(128, 256 etc..) at index-time via Codec & store it as part of DocValues at
a Segment level..

During query time, could we take advantage of this Stats when Top-N query
with Sort Field is requested?

Typically, what I had in mind is a SortStats class with the following method

int *seek*(int *max-doc-seen-till-now*, int *min-sort-ord-seen-till-now*,
boolean sortDesc) {
// 1. Fetch the doc-ranges that has >=
*min-sort-ord-seen-till-now*
* // 2. *Return the least doc-range >= *max-doc-seen-till-now *(If
SortDesc=true)
* Return the least doc-range <= max-doc-seen-till-now *(If
SortDesc=false)
}

Top-N Collector can keep track of the *max-doc-seen-till-now &
min-sort-ord-seen-till-now *variable during query time & then call the
*SortStats.seek()* for a possible skip of blocks of documents that may
otherwise be needlessly offered & popped out from the priority queue

I understand this simplistic logic depends on sort-field data distribution
& won't work for multi-sort field queries or out-of-order scoring etc..

But, in general will this be a good idea to explore or something that is
best not attempted?

Any help is much appreciated

--
Ravi
Re: block min-max values for Sort Field with Top-N query.. [ In reply to ]
Not sure what is the problem, but make sure you are aware of
https://lucene.apache.org/solr/guide/7_0/function-queries.html#childfield-field-function
.

On Tue, Jul 2, 2019 at 4:01 PM Ravikumar Govindarajan <
ravikumar.govindarajan@gmail.com> wrote:

> Our Sort Fields utilize DocValues..
>
> Lets say I collect min-max ords of a Sort Field for a block of documents
> (128, 256 etc..) at index-time via Codec & store it as part of DocValues at
> a Segment level..
>
> During query time, could we take advantage of this Stats when Top-N query
> with Sort Field is requested?
>
> Typically, what I had in mind is a SortStats class with the following
> method
>
> int *seek*(int *max-doc-seen-till-now*, int *min-sort-ord-seen-till-now*,
> boolean sortDesc) {
> // 1. Fetch the doc-ranges that has >=
> *min-sort-ord-seen-till-now*
> * // 2. *Return the least doc-range >= *max-doc-seen-till-now *(If
> SortDesc=true)
> * Return the least doc-range <= max-doc-seen-till-now *(If
> SortDesc=false)
> }
>
> Top-N Collector can keep track of the *max-doc-seen-till-now &
> min-sort-ord-seen-till-now *variable during query time & then call the
> *SortStats.seek()* for a possible skip of blocks of documents that may
> otherwise be needlessly offered & popped out from the priority queue
>
> I understand this simplistic logic depends on sort-field data distribution
> & won't work for multi-sort field queries or out-of-order scoring etc..
>
> But, in general will this be a good idea to explore or something that is
> best not attempted?
>
> Any help is much appreciated
>
> --
> Ravi
>


--
Sincerely yours
Mikhail Khludnev
Re: block min-max values for Sort Field with Top-N query.. [ In reply to ]
Hello,

This is the same principle that we apply for block-max WAND so
theoretically that would work, though in practice it might be a bit
hard to implement due to the fact that we don't have the APIs that you
will need.

I have considered the idea of adding information about blocks to doc
values a couple times, but I think it'd be better to either:
- Directly index the field into as a term frequency instead of doc
values, e.g. using FeatureField. One downside is that you can only
sort in one order efficiently.
- Or using LongDistanceFeatureQuery if your field is also indexed
with points, by passing the max value of your index as the "origin" if
you want to sort in decreasing order and the min value if you want to
sort in increasing order. This would be a bit less efficient than
FeatureField but would allow sorting in either ascending or descending
order.



On Tue, Jul 2, 2019 at 3:01 PM Ravikumar Govindarajan
<ravikumar.govindarajan@gmail.com> wrote:
>
> Our Sort Fields utilize DocValues..
>
> Lets say I collect min-max ords of a Sort Field for a block of documents
> (128, 256 etc..) at index-time via Codec & store it as part of DocValues at
> a Segment level..
>
> During query time, could we take advantage of this Stats when Top-N query
> with Sort Field is requested?
>
> Typically, what I had in mind is a SortStats class with the following method
>
> int *seek*(int *max-doc-seen-till-now*, int *min-sort-ord-seen-till-now*,
> boolean sortDesc) {
> // 1. Fetch the doc-ranges that has >=
> *min-sort-ord-seen-till-now*
> * // 2. *Return the least doc-range >= *max-doc-seen-till-now *(If
> SortDesc=true)
> * Return the least doc-range <= max-doc-seen-till-now *(If
> SortDesc=false)
> }
>
> Top-N Collector can keep track of the *max-doc-seen-till-now &
> min-sort-ord-seen-till-now *variable during query time & then call the
> *SortStats.seek()* for a possible skip of blocks of documents that may
> otherwise be needlessly offered & popped out from the priority queue
>
> I understand this simplistic logic depends on sort-field data distribution
> & won't work for multi-sort field queries or out-of-order scoring etc..
>
> But, in general will this be a good idea to explore or something that is
> best not attempted?
>
> Any help is much appreciated
>
> --
> Ravi



--
Adrien

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: block min-max values for Sort Field with Top-N query.. [ In reply to ]
Thanks Mikhail & Adrien for the help

This is the same principle that we apply for block-max WAND so
> theoretically that would work, though in practice it might be a bit
> hard to implement due to the fact that we don't have the APIs that you
> will need.


Aah, did not know block-max WAND is now in lucene! So what I am proposing
looks identical to Bm-WAND..

The heavy-lifting is already done in lucene codebase. Think it should be
straight-forward for us to wrap DocValues in a CustomCodec to track block
min-max ords. We shall give this a shot anyways & see how it goes

Directly index the field into as a term frequency instead of doc
> values, e.g. using FeatureField. One downside is that you can only
> sort in one order efficiently.
>

Thanks for suggestion. Sure will try & dabble with FeatureField too!

--
Ravi

On Tue, Jul 2, 2019 at 6:52 PM Adrien Grand <jpountz@gmail.com> wrote:

> Hello,
>
> This is the same principle that we apply for block-max WAND so
> theoretically that would work, though in practice it might be a bit
> hard to implement due to the fact that we don't have the APIs that you
> will need.
>
> I have considered the idea of adding information about blocks to doc
> values a couple times, but I think it'd be better to either:
> - Directly index the field into as a term frequency instead of doc
> values, e.g. using FeatureField. One downside is that you can only
> sort in one order efficiently.
> - Or using LongDistanceFeatureQuery if your field is also indexed
> with points, by passing the max value of your index as the "origin" if
> you want to sort in decreasing order and the min value if you want to
> sort in increasing order. This would be a bit less efficient than
> FeatureField but would allow sorting in either ascending or descending
> order.
>
>
>
> On Tue, Jul 2, 2019 at 3:01 PM Ravikumar Govindarajan
> <ravikumar.govindarajan@gmail.com> wrote:
> >
> > Our Sort Fields utilize DocValues..
> >
> > Lets say I collect min-max ords of a Sort Field for a block of documents
> > (128, 256 etc..) at index-time via Codec & store it as part of DocValues
> at
> > a Segment level..
> >
> > During query time, could we take advantage of this Stats when Top-N query
> > with Sort Field is requested?
> >
> > Typically, what I had in mind is a SortStats class with the following
> method
> >
> > int *seek*(int *max-doc-seen-till-now*, int *min-sort-ord-seen-till-now*,
> > boolean sortDesc) {
> > // 1. Fetch the doc-ranges that has >=
> > *min-sort-ord-seen-till-now*
> > * // 2. *Return the least doc-range >= *max-doc-seen-till-now *(If
> > SortDesc=true)
> > * Return the least doc-range <= max-doc-seen-till-now *(If
> > SortDesc=false)
> > }
> >
> > Top-N Collector can keep track of the *max-doc-seen-till-now &
> > min-sort-ord-seen-till-now *variable during query time & then call the
> > *SortStats.seek()* for a possible skip of blocks of documents that may
> > otherwise be needlessly offered & popped out from the priority queue
> >
> > I understand this simplistic logic depends on sort-field data
> distribution
> > & won't work for multi-sort field queries or out-of-order scoring etc..
> >
> > But, in general will this be a good idea to explore or something that is
> > best not attempted?
> >
> > Any help is much appreciated
> >
> > --
> > Ravi
>
>
>
> --
> Adrien
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>