Our Sort Fields utilize DocValues..
Lets say I collect min-max ords of a Sort Field for a block of documents
(128, 256 etc..) at index-time via Codec & store it as part of DocValues at
a Segment level..
During query time, could we take advantage of this Stats when Top-N query
with Sort Field is requested?
Typically, what I had in mind is a SortStats class with the following method
int *seek*(int *max-doc-seen-till-now*, int *min-sort-ord-seen-till-now*,
boolean sortDesc) {
// 1. Fetch the doc-ranges that has >=
*min-sort-ord-seen-till-now*
* // 2. *Return the least doc-range >= *max-doc-seen-till-now *(If
SortDesc=true)
* Return the least doc-range <= max-doc-seen-till-now *(If
SortDesc=false)
}
Top-N Collector can keep track of the *max-doc-seen-till-now &
min-sort-ord-seen-till-now *variable during query time & then call the
*SortStats.seek()* for a possible skip of blocks of documents that may
otherwise be needlessly offered & popped out from the priority queue
I understand this simplistic logic depends on sort-field data distribution
& won't work for multi-sort field queries or out-of-order scoring etc..
But, in general will this be a good idea to explore or something that is
best not attempted?
Any help is much appreciated
--
Ravi
Lets say I collect min-max ords of a Sort Field for a block of documents
(128, 256 etc..) at index-time via Codec & store it as part of DocValues at
a Segment level..
During query time, could we take advantage of this Stats when Top-N query
with Sort Field is requested?
Typically, what I had in mind is a SortStats class with the following method
int *seek*(int *max-doc-seen-till-now*, int *min-sort-ord-seen-till-now*,
boolean sortDesc) {
// 1. Fetch the doc-ranges that has >=
*min-sort-ord-seen-till-now*
* // 2. *Return the least doc-range >= *max-doc-seen-till-now *(If
SortDesc=true)
* Return the least doc-range <= max-doc-seen-till-now *(If
SortDesc=false)
}
Top-N Collector can keep track of the *max-doc-seen-till-now &
min-sort-ord-seen-till-now *variable during query time & then call the
*SortStats.seek()* for a possible skip of blocks of documents that may
otherwise be needlessly offered & popped out from the priority queue
I understand this simplistic logic depends on sort-field data distribution
& won't work for multi-sort field queries or out-of-order scoring etc..
But, in general will this be a good idea to explore or something that is
best not attempted?
Any help is much appreciated
--
Ravi