Mailing List Archive: Re: Query-Weight-Scorer hierarchy

On 1/30/08, Marvin Humphrey <marvin@rectangular.com> wrote:
> Query objects represent an abstract ideal. They don't dirty their
> hands with actual real-world index data.
>
> Query: A pure, abstract representation of a logical query.
> Weight: Applies a query to a particular collection of documents.
> Scorer: Applies a query to individual documents.
>
> This is an area that's ripe for refactoring. I've already pulled out
> a bunch of cruft that was inherited from Lucene. Why don't we see
> what we come up with if we go back to first principles? I think the
> division of labor in the Query-Weight-Scorer hierarchy described
> above is sound. Do you agree?

I apologize for the delay on getting back to you on this. This
strikes me as a very important discussion to have, and I'd love to
have a wider discussion about it. When I was spending lots of time
with KinoSearch, my poor understanding of how these pieces interrelate
really hampered me. I'm sure some of this was due to my incorrect
preconceptions, but I also think that clearer and simpler architecture
here would help the project a lot.

Rather than discussing whether the view you offer is sound, let me
start by offering my naive stick figure drawing of how I picture
search working:

----

A Query is produced by a Parser: a user enters text, and this text is
compiled into a Query. A Query can be composed of sub-Queries, which
could be run as Queries in their own right. At the C level, Queries
have a public data API, which is filled in by Index specific Posting.
The tree structure of the Query is also public, so that Scorers can
walk this tree and run sub-Scorers attached to sub-Queries.

When the Query is run against an Index, it is initialized top-down.
The bottom level TermQueries are associated with Postings specific to
the Index format being used, and Scorers associated with each Query
are initialized. Then there is a loop, where each Query is advanced
until it finds a matching document. The Queries propagate the advance
downward, ratcheting forward until satisfied with its children's
states.

When the topmost Query has found a matching document, a Scorer
associated with the top Query is run. This Scorer returns a numeric
score, but can generate this number however it chooses. A simple
scorer (on a TermQuery) might simply return the weight it was passed
at initialization. A more complex Scorer (for an OrQuery) might run
the sub-Scorers on each of the children of its associated Query and
return the result of its highest sub-Scoring child.

But while a Scorer might be a default for given type of Query, it is
not intimately associated with it. A PhraseQuery might choose to use
a simple WeightScorer and the sub-Scorers associated with the Terms
would never be called. Certain Scorers might only make sense with
queries that have children, but this would be enforced by typechecks
at initialization.

The score is passed to a Collector along with the current filled in
Query. The Collector keeps a record of the scores it wants (top N,
top N with an offset, top N below a set score, all scores above a
minimum, whatever) and discards the rest. If it wants, it can cache
other information associated with the Query to be returned with the
Results. After there are no more matching documents, some 'finalize'
method is called on the Collector, allowing it to tidy up and return
its Results.

----

I'm sure there in contradictions in my description, and probably some
things that would be impossible to implement exactly as described.
But this is basically the architecture that makes sense to me
(although it's true I'm thinking a lot more about ice cream right now
than search). The current KinoSearch architecture --- even though I
put a lot of time into spelunking it --- never really sunk in. In
particular, I never figured out why Weights existed.

Which is to say that I don't think the current division of labor is
that sound, and that I'd love to help you come up with a basic
architecture that is simpler and more flexible. Unfortunately, I'm
probably too far out of KinoSearch mode right now to be of much use
other than as a sounding board. But I'll do all I can to encourage
you to keep thinking about the problem from first principles, since I
think you are bound to come up with something I like better than the
current Lucene based architecture!

Nathan Kurz
nate@verse.com

_______________________________________________
KinoSearch mailing list
KinoSearch@rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch