Mailing List Archive

Question abount combining InvertedIndex and SortField
Assume i first use keyword search to get a DocIDSet from inverted index,
then i want to sort these docIds by some numeric field, like a
`updateTime`, does Lucene do this without need of loading the Document
objects but only with an sorted index on `updateTime`? Which i call it
"Index-Only Sort Optimization" (MUST be some equal concepts in RDBMS?)

And since Lucene has a `SortField` API, what does it do the sort? I thought
SortField is just a post-processing...
Re: Question abount combining InvertedIndex and SortField [ In reply to ]
Hello, ???.

On Tue, Dec 31, 2019 at 6:32 AM ??? <ctengctsh@gmail.com> wrote:

> Assume i first use keyword search to get a DocIDSet from inverted index,
> then i want to sort these docIds by some numeric field, like a
> `updateTime`, does Lucene do this without need of loading the Document
> objects but only with an sorted index on `updateTime`?

1. Lucene doesn't load Document objects from stored fields files while
sorting for sure.
2. Lucene uses dedicated columnar data structure (DocVaues made index time,
or in the worst case lazily loaded FieldCache) to obtain field values while
collecting search results from inverted index.
3. One deviation from this generic algorithm is sorted index and early
termination, that's probably what you meant in "Index-Only Search
Optimization".


> Which i call it
> "Index-Only Sort Optimization" (MUST be some equal concepts in RDBMS?)
>
> And since Lucene has a `SortField` API, what does it do the sort? I thought
>
It brings up TopFieldCollector instead of the default TopScoreDocCollector.

> SortField is just a post-processing...
>
Not really. Scoring/sorting should be done along side with searching to
reduce memory footprint by storing only top candidate results in a binary
heap.
IIRC it's described in this classic paper
http://www.savar.se/media/1181/space_optimizations_for_total_ranking.pdf


--
Sincerely yours
Mikhail Khludnev
Re: Question abount combining InvertedIndex and SortField [ In reply to ]
Hi, Mikhail

Your words is very encouraging. I was thinking i might need to do another
Lucene custom query to apply my business-specific "index-only sort" and
"early-termination"... SortField API says can use any numeric/String field,
which is very perfect. (In this way, Lucene should be able to a high-perf
Top-N query if SortField can support dynamically-generated ranking
scores... not only the native indexed numeric/String fields)

Mikhail Khludnev <mkhl@apache.org> ?2019?12?31??? ??4:41???

> Hello, ???.
>
> On Tue, Dec 31, 2019 at 6:32 AM ??? <ctengctsh@gmail.com> wrote:
>
> > Assume i first use keyword search to get a DocIDSet from inverted index,
> > then i want to sort these docIds by some numeric field, like a
> > `updateTime`, does Lucene do this without need of loading the Document
> > objects but only with an sorted index on `updateTime`?
>
> 1. Lucene doesn't load Document objects from stored fields files while
> sorting for sure.
> 2. Lucene uses dedicated columnar data structure (DocVaues made index time,
> or in the worst case lazily loaded FieldCache) to obtain field values while
> collecting search results from inverted index.
> 3. One deviation from this generic algorithm is sorted index and early
> termination, that's probably what you meant in "Index-Only Search
> Optimization".
>
>
> > Which i call it
> > "Index-Only Sort Optimization" (MUST be some equal concepts in RDBMS?)
> >
> > And since Lucene has a `SortField` API, what does it do the sort? I
> thought
> >
> It brings up TopFieldCollector instead of the default TopScoreDocCollector.
>
> > SortField is just a post-processing...
> >
> Not really. Scoring/sorting should be done along side with searching to
> reduce memory footprint by storing only top candidate results in a binary
> heap.
> IIRC it's described in this classic paper
> http://www.savar.se/media/1181/space_optimizations_for_total_ranking.pdf
>
>
> --
> Sincerely yours
> Mikhail Khludnev
>