Mailing List Archive

Query-Weight-Scorer hierarchy (was Re: Wildcards)
On Jan 29, 2008, at 8:06 PM, Nathan Kurz wrote:

> What I meant to say was that
> the globals information doesn't need to be known by the query, only by
> the Scorer.

By "globals information", I presume you mean the IDF. IDF is needed
by the Weight.

> The Query would deal with only the per-document data.

That confuses me.

Query objects represent an abstract ideal. They don't dirty their
hands with actual real-world index data.

Query: A pure, abstract representation of a logical query.
Weight: Applies a query to a particular collection of documents.
Scorer: Applies a query to individual documents.

So the Weight deals with the per-collection information, and it's the
Scorer -- not the Query -- that deals with per-document data.

This actually has implications for generating HighlightSpan objects.
I've been saying that we should go back to the Query for that, but
really, Query objects won't know what to do with an individual
document. We'll have to compile the Query to a Weight to a Scorer
and have the Scorer perform that task.

>> Or maybe the default TermQuery class can do flat scoring and
>> TFIDFTermQuery would override? I imagine that would make you
>> happy. ;)
>
> Given the smileys, I'm not sure if this is a joke or not. To be
> clear, this solution would make me ill.

Heh. No, I was serious.

> My desire is to separate the
> query from the scoring, so having a different Query class for each
> possible scoring option is the antithesis of what I want. What I want
> is to have a number of independent Scorers that can be plugged into a
> Scorer-agnostic set of Queries: simple Queries, simple Scorers,
> complex combinations.

That's an interesting vision. It's sort-of at odds with how things
currently work, because the expectation is that a FooQuery will be
associated with FooWeight and FooScorer.

However, BooleanScorers are aggregates of many other Scorers, and a
PhraseWeight will actually kick out a TermScorer if you only give it
one term. Plus, the association of a field with particular Posting
and Similarity subclasses affects how scores are calculated.

This is an area that's ripe for refactoring. I've already pulled out
a bunch of cruft that was inherited from Lucene. Why don't we see
what we come up with if we go back to first principles? I think the
division of labor in the Query-Weight-Scorer hierarchy described
above is sound. Do you agree?

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/



_______________________________________________
KinoSearch mailing list
KinoSearch@rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch