Mailing List Archive: the guts of a query

As part of getting started in writing a custom scorer using rich
positions, I'm trying to understand the how queries are currently
processed internally. Here's the overview I've come up with:

[.Archival note: The process below probably does not reflect how the
innards actually work. Consult the code or the real docs for
something authoritative.]

1. Query is parsed with KinoSearch::QueryParser.
Produces Query objects, possibly including:
KinoSearch::Search::BooleanQuery (logic joining subqueries)
KinoSearch::Search::PhraseQuery (list of terms which in order)
KinoSearch::Search::TermQuery (a query wrapped around a single term)

TermQuery and PhraseQuery consist of term objects as defined in:
KinoSearch::Index::Term
Terms are defined at the C level, and specify only a text string
and a field.

Phrase queries (as created by QueryParser) are always more than one term.

It's possible to weight the importance of terms in a Boolean OR
query, but not via QueryParser. This is done by calling
set_boost() on the Query object (generally TermQuery or PhraseQuery)
after it is created. This is only easy if you are constructing your
Query object by some means other than calling QueryParser.

2. This query is passed to Kinosearch::Searcher->search()
search() is defined in the Kinosearch::Searchable base class.

Search creates a new Kinosearch::Search::Hits object, initializing it
with the Query and the Searcher, along with an optional Filter and SortSpec.
If a string is passed as a query, it is promoted to a Query
object via QueryParser

$hits->seek() is called on the new Query specific Hits object.

seek() checks if it has the requested number ($num_wanted +
$offset) of docs cached,
and returns doing nothing if it does.

seek() then calls $searcher->top_docs() requesting the top
($num_wanted + $offset) docs.
The $offset docs are not skipped, but searched and scored again.
The new top_docs results replaces the existing shorter cache (if any).

The new Hits object (with cached results) is returned.

3. $searcher->top_docs() first creates a new HitCollector object.
If a SortSpec is defined, it uses this to create a SortCollector.
If no SortSpec is defined, it creates a default TopDocCollector.

$searcher->collect() is called by $searcher->top_docs()
The new collector stores the top docs.
$searcher->collect() creates a discriminating wrapper around
the collector
if a Filter was specified in the call to $searcher->search().
Filtered docs are excluded at the point of collection, not
the point of search.
The prune_factor (if given) determines a maximum hits per segment.
Confused: this presumes that the top docs will be evenly
spread over the segments?
$searcher->create_weight() is called to provide a scratchpad
for scoring the query.
This in turn just calls $query->make_weight().
These weights are defined within their associated Query packages
ie, KinoSearch::Search::BooleanWeight is defined within
BooleanQuery.pm
Each subquery gets its own parallel subweight.
$weight->scorer() is called to create the scorer that will do
the actual collecting.
Each query has its corresponding scorer: BooleanQuery ->
BooleanScorer
Confused: When does $sub_weight->scorer($reader) return
undef (in BooleanQuery.pm)
Finally, $scorer->collect() is called by $searcher->collect().

4. $scorer->collect() (in Scorer.c) runs through each segment looking for hits.
Scorer_Skip_To is called an does a 'delayed_init' of the scorer
The first time through, an 'inner_scorer' assigned to
$scorer->scorer.
Subscorers are also created, one for each term in the BooleanQuery.
Subscorers of types like ANDScorer and ORScorer layered
over TermScorer
Scorer_Next is called on the top level Scorer
The subscorers skip_to doc_nums en masse until they find a
doc they agree on
The skip_to calls SegPList_read_bulk as necessary to
read from the Index.
The scores of all the subqueries are tallied according to the
Scorer specific tallying.
BooleanQuery makes use of pre-calculated coord_factors
These coord_factor are just the precalculated fractions of
the ratio of terms found?
Sim_prox_coord exists in Similiarity.c for position boost
but is not used anywhere?
$collector->collect() is called once per hit (not per doc_num
unless all hits)
TopDocCollector->collect() saves the doc if the score is
greater than the minimum
score in the current HitQueue.
At this point the Collector has cached the top hits and they are
returned by $hits->seek()

You don't need to spend much time, but I'd appreciate if you could
fill me in on any major pieces I've missed. Currently, it feels very
complex but like I still must be missing something. I think I need to
understand this (and the index creation) better before I start
implementing.

Thanks!

Nathan Kurz
nate@verse.com

Nathan,

Congrats on your tenacity. That's a lot of code spelunking!

I think what we need is high-level outline.

There are three main phases during a search query: compile, score,
and fetch.

* Compile:
o Parse QueryString or build Query object manually.
o Weight query.
o Build Scorer.
* Score: Iterate over matches, collecting hits into HitCollector.
* Fetch: retrieve stored documents (optional).

The first phase is the most complex. The last is the simplest.

The "Weight" part of the compilation phase is where a lot of IR
theory gets applied, and is probably the hardest thing to understand.

On Jun 13, 2007, at 12:49 AM, Nathan Kurz wrote:

[.note: I've reformatted some of your quotes as best I could, since
they got mangled]

> 1. Query is parsed with KinoSearch::QueryParser.
> Produces Query objects, possibly including:
> KinoSearch::Search::BooleanQuery (logic joining subqueries)
> KinoSearch::Search::PhraseQuery (list of terms which in order)
> KinoSearch::Search::TermQuery (a query wrapped around a
> single term)

QueryParser could conceivably spit out an entirely different Query
object. KinoSearch's QueryParser is designed to use classic web
search syntax. But you could, for example, ignore quotes and just
make every word a term, and combine everything into one big bag o'
TermQueries.

Nothing's stopping you from making up your own syntax and writing
your own QueryParser subclasses. The only constraint is that the end
product must be an object which isa KinoSearch::Search::Query which
you can feed to Searcher->search.

> Phrase queries (as created by QueryParser) are always more than
> one term.

It would be OK to have a PhraseQuery with only one term, because it
would produce identical results. However, both QueryParser and
PhraseWeight will detect a single-term phrase and optimize it into a
TermQuery/TermScorer.

> It's possible to weight the importance of terms in a Boolean OR
> query, but not via QueryParser. This is done by calling
> set_boost() on the Query object (generally TermQuery or
> PhraseQuery)
> after it is created. This is only easy if you are constructing
> your
> Query object by some means other than calling QueryParser.

set_boost() is a method common to all Query subclasses. You're right
that it's not accessible via QueryParser.

FWIW, Lucene's QueryParser uses a much more complicated syntax than
KinoSearch. Among other things, you can set boost like so:

foo^2 bar^5 "fee fie fo fum"^20

I chose not to support that construct, but someone could write a
custom QueryParser that recognized it.

>
> 2. This query is passed to Kinosearch::Searcher->search()
> search() is defined in the Kinosearch::Searchable base class.
>
> Search creates a new Kinosearch::Search::Hits object,
> initializing it
> with the Query and the Searcher, along with an optional Filter
> and SortSpec.
> If a string is passed as a query, it is promoted to a Query
> object via QueryParser
>
> $hits->seek() is called on the new Query specific Hits object.
>
> seek() checks if it has the requested number ($num_wanted +
> $offset) of docs cached,
> and returns doing nothing if it does.
>
> seek() then calls $searcher->top_docs() requesting the top
> ($num_wanted + $offset) docs.
> The $offset docs are not skipped, but searched and scored again.
> The new top_docs results replaces the existing shorter cache
> (if any).
>
> The new Hits object (with cached results) is returned.

This somewhat convoluted sequence of events is more of an
implementation detail than anything else. The only reason that the
Hits object really needs to get involved is that Hits->seek is part
of the public API and calling that may require reprocessing the
search. If it weren't for that, we could run everything inside the
Searcher and build a Hits object at the very end, which would be more
intuitive.

> 3. $searcher->top_docs() first creates a new HitCollector object.
> If a SortSpec is defined, it uses this to create a
> SortCollector.
> If no SortSpec is defined, it creates a default TopDocCollector.
>
> $searcher->collect() is called by $searcher->top_docs()
> The new collector stores the top docs.
> $searcher->collect() creates a discriminating wrapper around
> the collector
> if a Filter was specified in the call to $searcher->search().
> Filtered docs are excluded at the point of collection, not
> the point of search.

The important thing to recognize here is that you're going to be
iterating over a bunch of doc_num/score pairs during Scorer-
>collect. What kind of HitCollector you use determines what you do
with those.

> The prune_factor (if given) determines a maximum hits per
> segment.
> Confused: this presumes that the top docs will be evenly
> spread over the segments?

Ignore prune_factor for now. It never affect things unless you
specifically enable it.

> $searcher->create_weight() is called to provide a scratchpad
> for scoring the query.
> This in turn just calls $query->make_weight().
> These weights are defined within their associated Query
> packages
> ie, KinoSearch::Search::BooleanWeight is defined within
> BooleanQuery.pm
> Each subquery gets its own parallel subweight.

The Weight subclasses were, I believe, a later addition to Lucene.
They are there so that you don't have to assign index-specific
information to Query objects, facilitating reuse.

> Confused: When does $sub_weight->scorer($reader) return undef (in
> BooleanQuery.pm)

$weight->scorer($reader) can return undef rather than a Scorer if we
know for certain that the Scorer will not match any docs. Say you
have a PhraseQuery for "Big Bird's best friend is Snuffleupagus" --
if 'Snuffleupagus' isn't in the index represented by $reader, then
the whole phrase can never match.

This is an example of why it's handy to separate concerns via Query/
Weight/Scorer. The same PhraseQuery object can be used with multiple
indexes. Indexes with and without 'Snuffleupagus' in them will
process the same PhraseQuery and compile it to something else.

> Finally, $scorer->collect() is called by $searcher->collect().
>
> 4. $scorer->collect() (in Scorer.c) runs through each segment
> looking for hits.
> Scorer_Skip_To is called an does a 'delayed_init' of the scorer
> The first time through, an 'inner_scorer' assigned to
> $scorer->scorer.
> Subscorers are also created, one for each term in the
> BooleanQuery.
> Subscorers of types like ANDScorer and ORScorer layered
> over TermScorer
> Scorer_Next is called on the top level Scorer
> The subscorers skip_to doc_nums en masse until they find a
> doc they agree on
> The skip_to calls SegPList_read_bulk as necessary to
> read from the Index.
> The scores of all the subqueries are tallied according to the
> Scorer specific tallying.
> BooleanQuery makes use of pre-calculated coord_factors
> These coord_factor are just the precalculated fractions
> of the ratio of terms found?

The coord is there to give an "AND-like" boost to documents which
match several clauses of a boolean query. It's a minor factor in the
grand scheme of things.

Rather than trying to grok BooleanScorer right off the bat, I suggest
you try to acquire a level of comfort following the progress of a
TermQuery.

> Sim_prox_coord exists in Similiarity.c for position boost
> but is not used anywhere?

That's right, it remains unimplemented.

> $collector->collect() is called once per hit (not per doc_num
> unless all hits)
> TopDocCollector->collect() saves the doc if the score is
> greater than the minimum score in the current HitQueue.
> At this point the Collector has cached the top hits and they are
> returned by $hits->seek()

A point of clarification: the collector only deals with doc_num/score
pairs. The actual stored documents don't get fetched until you start
calling Hits->fetch_hit_hashref.

HTH,

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/