As part of getting started in writing a custom scorer using rich
positions, I'm trying to understand the how queries are currently
processed internally. Here's the overview I've come up with:
[.Archival note: The process below probably does not reflect how the
innards actually work. Consult the code or the real docs for
something authoritative.]
1. Query is parsed with KinoSearch::QueryParser.
Produces Query objects, possibly including:
KinoSearch::Search::BooleanQuery (logic joining subqueries)
KinoSearch::Search::PhraseQuery (list of terms which in order)
KinoSearch::Search::TermQuery (a query wrapped around a single term)
TermQuery and PhraseQuery consist of term objects as defined in:
KinoSearch::Index::Term
Terms are defined at the C level, and specify only a text string
and a field.
Phrase queries (as created by QueryParser) are always more than one term.
It's possible to weight the importance of terms in a Boolean OR
query, but not via QueryParser. This is done by calling
set_boost() on the Query object (generally TermQuery or PhraseQuery)
after it is created. This is only easy if you are constructing your
Query object by some means other than calling QueryParser.
2. This query is passed to Kinosearch::Searcher->search()
search() is defined in the Kinosearch::Searchable base class.
Search creates a new Kinosearch::Search::Hits object, initializing it
with the Query and the Searcher, along with an optional Filter and SortSpec.
If a string is passed as a query, it is promoted to a Query
object via QueryParser
$hits->seek() is called on the new Query specific Hits object.
seek() checks if it has the requested number ($num_wanted +
$offset) of docs cached,
and returns doing nothing if it does.
seek() then calls $searcher->top_docs() requesting the top
($num_wanted + $offset) docs.
The $offset docs are not skipped, but searched and scored again.
The new top_docs results replaces the existing shorter cache (if any).
The new Hits object (with cached results) is returned.
3. $searcher->top_docs() first creates a new HitCollector object.
If a SortSpec is defined, it uses this to create a SortCollector.
If no SortSpec is defined, it creates a default TopDocCollector.
$searcher->collect() is called by $searcher->top_docs()
The new collector stores the top docs.
$searcher->collect() creates a discriminating wrapper around
the collector
if a Filter was specified in the call to $searcher->search().
Filtered docs are excluded at the point of collection, not
the point of search.
The prune_factor (if given) determines a maximum hits per segment.
Confused: this presumes that the top docs will be evenly
spread over the segments?
$searcher->create_weight() is called to provide a scratchpad
for scoring the query.
This in turn just calls $query->make_weight().
These weights are defined within their associated Query packages
ie, KinoSearch::Search::BooleanWeight is defined within
BooleanQuery.pm
Each subquery gets its own parallel subweight.
$weight->scorer() is called to create the scorer that will do
the actual collecting.
Each query has its corresponding scorer: BooleanQuery ->
BooleanScorer
Confused: When does $sub_weight->scorer($reader) return
undef (in BooleanQuery.pm)
Finally, $scorer->collect() is called by $searcher->collect().
4. $scorer->collect() (in Scorer.c) runs through each segment looking for hits.
Scorer_Skip_To is called an does a 'delayed_init' of the scorer
The first time through, an 'inner_scorer' assigned to
$scorer->scorer.
Subscorers are also created, one for each term in the BooleanQuery.
Subscorers of types like ANDScorer and ORScorer layered
over TermScorer
Scorer_Next is called on the top level Scorer
The subscorers skip_to doc_nums en masse until they find a
doc they agree on
The skip_to calls SegPList_read_bulk as necessary to
read from the Index.
The scores of all the subqueries are tallied according to the
Scorer specific tallying.
BooleanQuery makes use of pre-calculated coord_factors
These coord_factor are just the precalculated fractions of
the ratio of terms found?
Sim_prox_coord exists in Similiarity.c for position boost
but is not used anywhere?
$collector->collect() is called once per hit (not per doc_num
unless all hits)
TopDocCollector->collect() saves the doc if the score is
greater than the minimum
score in the current HitQueue.
At this point the Collector has cached the top hits and they are
returned by $hits->seek()
You don't need to spend much time, but I'd appreciate if you could
fill me in on any major pieces I've missed. Currently, it feels very
complex but like I still must be missing something. I think I need to
understand this (and the index creation) better before I start
implementing.
Thanks!
Nathan Kurz
nate@verse.com
positions, I'm trying to understand the how queries are currently
processed internally. Here's the overview I've come up with:
[.Archival note: The process below probably does not reflect how the
innards actually work. Consult the code or the real docs for
something authoritative.]
1. Query is parsed with KinoSearch::QueryParser.
Produces Query objects, possibly including:
KinoSearch::Search::BooleanQuery (logic joining subqueries)
KinoSearch::Search::PhraseQuery (list of terms which in order)
KinoSearch::Search::TermQuery (a query wrapped around a single term)
TermQuery and PhraseQuery consist of term objects as defined in:
KinoSearch::Index::Term
Terms are defined at the C level, and specify only a text string
and a field.
Phrase queries (as created by QueryParser) are always more than one term.
It's possible to weight the importance of terms in a Boolean OR
query, but not via QueryParser. This is done by calling
set_boost() on the Query object (generally TermQuery or PhraseQuery)
after it is created. This is only easy if you are constructing your
Query object by some means other than calling QueryParser.
2. This query is passed to Kinosearch::Searcher->search()
search() is defined in the Kinosearch::Searchable base class.
Search creates a new Kinosearch::Search::Hits object, initializing it
with the Query and the Searcher, along with an optional Filter and SortSpec.
If a string is passed as a query, it is promoted to a Query
object via QueryParser
$hits->seek() is called on the new Query specific Hits object.
seek() checks if it has the requested number ($num_wanted +
$offset) of docs cached,
and returns doing nothing if it does.
seek() then calls $searcher->top_docs() requesting the top
($num_wanted + $offset) docs.
The $offset docs are not skipped, but searched and scored again.
The new top_docs results replaces the existing shorter cache (if any).
The new Hits object (with cached results) is returned.
3. $searcher->top_docs() first creates a new HitCollector object.
If a SortSpec is defined, it uses this to create a SortCollector.
If no SortSpec is defined, it creates a default TopDocCollector.
$searcher->collect() is called by $searcher->top_docs()
The new collector stores the top docs.
$searcher->collect() creates a discriminating wrapper around
the collector
if a Filter was specified in the call to $searcher->search().
Filtered docs are excluded at the point of collection, not
the point of search.
The prune_factor (if given) determines a maximum hits per segment.
Confused: this presumes that the top docs will be evenly
spread over the segments?
$searcher->create_weight() is called to provide a scratchpad
for scoring the query.
This in turn just calls $query->make_weight().
These weights are defined within their associated Query packages
ie, KinoSearch::Search::BooleanWeight is defined within
BooleanQuery.pm
Each subquery gets its own parallel subweight.
$weight->scorer() is called to create the scorer that will do
the actual collecting.
Each query has its corresponding scorer: BooleanQuery ->
BooleanScorer
Confused: When does $sub_weight->scorer($reader) return
undef (in BooleanQuery.pm)
Finally, $scorer->collect() is called by $searcher->collect().
4. $scorer->collect() (in Scorer.c) runs through each segment looking for hits.
Scorer_Skip_To is called an does a 'delayed_init' of the scorer
The first time through, an 'inner_scorer' assigned to
$scorer->scorer.
Subscorers are also created, one for each term in the BooleanQuery.
Subscorers of types like ANDScorer and ORScorer layered
over TermScorer
Scorer_Next is called on the top level Scorer
The subscorers skip_to doc_nums en masse until they find a
doc they agree on
The skip_to calls SegPList_read_bulk as necessary to
read from the Index.
The scores of all the subqueries are tallied according to the
Scorer specific tallying.
BooleanQuery makes use of pre-calculated coord_factors
These coord_factor are just the precalculated fractions of
the ratio of terms found?
Sim_prox_coord exists in Similiarity.c for position boost
but is not used anywhere?
$collector->collect() is called once per hit (not per doc_num
unless all hits)
TopDocCollector->collect() saves the doc if the score is
greater than the minimum
score in the current HitQueue.
At this point the Collector has cached the top hits and they are
returned by $hits->seek()
You don't need to spend much time, but I'd appreciate if you could
fill me in on any major pieces I've missed. Currently, it feels very
complex but like I still must be missing something. I think I need to
understand this (and the index creation) better before I start
implementing.
Thanks!
Nathan Kurz
nate@verse.com