On 7/3/07, Marvin Humphrey <marvin@rectangular.com> wrote:
> > no Searcher,
>
> Hmm. I don't see the advantage in that.
You certainly may be right. And to reiterate, I'm not suggesting that
these things would be good for KinoSearch as a whole, only for my
particular purposes. After I've got something that works we (you) can
consider adopting any (or none) of them. I'm playing blue-sky here.
It's not that I see a great advantage in getting rid of Searcher, only
that I want to flatten the hierarchy. I think it is of conceptual
benefit, but we'll see if the code comes out OK. I like that it
exposes the Scorers better.
> > no Weight objects,
>
> Note that you have to modify the Query itself. If you try to wait
> until creating the Scorer to perform the weighting, you run into
> problems with MultiSearcher, specifically with the calculation of
> IDF. The MultiSearcher object knows how common a term is across all
> the sub-indexes *combined*. That's information that each individual
> IndexReader doesn't have access to.
Wow, so you are making multiple round trips to the search servers? I
was guessing that a local approximation would suffice, so that each
server would be independent. I guess that's only way to get truly
accurate weights, but I wonder if the increased precision is worth the
cost.
I don't see that this means I have to modify the Query, though. I've
thought a little bit about distributed search, but only for my
specific non-weighted case. I've been presuming (for my purposes) a
parallel set of HTTP connections to the server farm, each returning N
top hits, from which the top N are selected.
I'll think more.
> > no Similarity object,
>
> I know that length normalization doesn't matter for your application,
> but it's key for standard tf/idf.
Roger. As I said, I'm optimizing for my particular needs first. I do
like that this data is well encapsulated. Rather than saying 'no
Similarity object', perhaps it would be better to say that the
Similarity object becomes scorer specific: each set of cooperating
scorers shares a ScorerData object, which according to the needs of
that scoring system may or may not look like the current Similarity
object.
> > no delayed inits,
>
> Any of these that can go away, I say good riddance. They are usually
> artifacts of a simplified external API, though. The delayed init
> spares the caller from the responsibility of invoking some init
> routine manually before a loop begins.
Yes, in particular it's because of the incremental formation of
BooleanScorer. Rather than requiring an init call, I'm thinking that
we can just require that all the clauses be passed to the constructor
like the subscorers currently do.
> What optimizations have you found?
Nothing major, and nothing tested. I think there is a small gain by
having Scorer_Advance return a doc number directly rather than a
boolean, obviating the need for a follow-up call to Scorer_Doc. I
think I've done a slightly better job of caching the current document
positions of the subscorers.
But some of this isn't even compiled yet, much less benchmarked. And
I'm sure I've missed some of the more subtle optimizations you've
made, as well as royally messed up in places. Once I have something
working correctly, I'll look closer at everything you're doing and
compare them.
> A thought: what if we did away with BooleanQuery, replacing it with
> ANDQuery, ORQuery, etc? I dunno that we want to open that can o'
> worms when 0.20 is so close, though.
Yes, that is the direction I was thinking. But upon further
reflection, I'm not sure if it's a good idea, though. I like bringing
the subscorers more into the light, but for the Proximity scoring I'm
doing I actually need a top level scorer in a position similar to that
which BooleanScorer is in now.
> > 2) The current ORScorer calls Tally on its subscorers at the same time
> > it is skipping through documents, rather than at the end of the phase.
> > Is this a good practice that I should emulate?
>
> That's true. However it's quite efficient for this:
>
> ((expensive-phrase OR expensive-phrase) AND rare-term).
You are right. A better worst-case example would have been:
(expensive-tally OR expensive-tally) AND (expensive-tally OR expensive-tally)
Assume the pessimal case where the first and-clause matches only odd
documents, and the second and-clause matches only even. In the
current code, we'd still be performing a lot of expensive Tally's even
though in theory we don't need to perform any.
I'm still not sure if this makes a difference, though, and whether or
not I should try to keep Tally clearly distinct and after Advance. So
long as finding a match is more expensive than tallying it, it's not
going to make much of a difference.
> > ps. I like the direction of KinoSearch::Simple, particularly the
> > integration of the indexing and searching. I'm tempted to think that
> > rather than calling it 'Simple', you should just call it 'KinoSearch'
> > and eventually have it be the main API.
>
> I disagree. Searching and indexing are completely distinct tasks.
I appreciate your disagreement, although I think that there isn't much
danger of God objects if using a Has-A
(
http://en.wikipedia.org/wiki/Has-a) approach like Hans is doing.
In my mind, there is an Index, and on this Index one can perform
operations. I was suggesting that internally searching and indexing
would still be distinct, but that these operations would be accessible
through a single composite (not multiply-inherited) KinoSearch object.
Going back to the top of this message, I think I'd like to see that
Index/KinoSearch object replace the current Searcher object as a
convenience composite object. But this is a mental model question:
internally, I too want them to be clearly separated.
Nathan Kurz
nate@verse.com