On 7/2/07, Marvin Humphrey <marvin@rectangular.com> wrote:
> do {
> ...
> } while (Scorer_Skip_To( self, BitVec_Next_Set_Bit(filter_bits) ));
I was thinking more along the lines of something that would fit into
the search infrastructure. One could create a FilterScorer that would
wrap an existing scorer. Something like (obviously not exact):
$boolean_scorer = $query->scorer;
$filter_scorer = new FilterScorer($boolean_scorer, @filters);
$filter_scorer->collect();
While advancing through documents, FilterScorer works like an
AndScorer, but for scoring it only calls Tally on the internal
BooleanScorer. If one wanted, the actual filters could be
BitVectorScorers that would use a cached BitVector rather than reading
in Postings.
> The downside is that it takes the filtering logic out of HitCollector
> and puts it in Scorer. HitCollector is sparse; Scorer has a lot more
> going on.
(and elsewhere you wrote:
> Analogously, it should be possible to create a suite of Queries/
> Scorers which are optimized for matching alone.
This got me thinking: how would a Scorer optimized for matching alone
be different? The main difference is that wouldn't have any need for
Tally to be called. I think the only problem with using the existing
Scorers for matching is that they do a lot of unnecessary Tallying if
the score is thrown away.
So maybe rather than having Tally called as part of the
$scorer->collect() loop, it could be optionally called by the
HitCollector. A TopDocCollector would call Tally and use the score to
order hits, but a SortCollector wouldn't have to. HitCollector would
be passed a Scorer instead of just a score.
This was a late night idea, but it still seems to make sense to me this morning.
Does it make sense to you?
> PostingList objects are iterators that refill from disk; BitVectors
> are memory only. Few objects refilling from disk means fewer disk
> seeks means better performance.
My intuition is that BitVectors are going to be a win only in very
specific situations. If you can fit your entire index in RAM, the
system page cache is going to treat you well after the first access,
no BitVector needed. If your index is _much_ larger than RAM, the
BitVector is going to be unwieldily large too. If your index is a
size comparable to RAM, even at a 1:32 ratio your (multiple) cached
BitVectors are going to be displacing index that would otherwise have
been cached.
I wonder if one would get more bang by optimizing the index structure
(as you are already doing) than by resorting to BitVectors. My
personal impulse is to figure out how the indexes could be mmap'ed in
and used directly, thus saving the processing time in reconstituting
them and allowing for better shared resources across multiple search
threads/processes. My guess is that this could be a win over even the
BitVectors.
> However, BitVectors don't work well remotely, since they're too big
> to pass around -- that's why SearchClient doesn't support filtering.
This does strike me as a major disadvantage. Almost everything else
I've looked at in KinoSearch seems like it scales quite nicely to
multiple machines.
It would be nice to enable someone to use this for a new Google :).
Nathan Kurz
nate@verse.com
> do {
> ...
> } while (Scorer_Skip_To( self, BitVec_Next_Set_Bit(filter_bits) ));
I was thinking more along the lines of something that would fit into
the search infrastructure. One could create a FilterScorer that would
wrap an existing scorer. Something like (obviously not exact):
$boolean_scorer = $query->scorer;
$filter_scorer = new FilterScorer($boolean_scorer, @filters);
$filter_scorer->collect();
While advancing through documents, FilterScorer works like an
AndScorer, but for scoring it only calls Tally on the internal
BooleanScorer. If one wanted, the actual filters could be
BitVectorScorers that would use a cached BitVector rather than reading
in Postings.
> The downside is that it takes the filtering logic out of HitCollector
> and puts it in Scorer. HitCollector is sparse; Scorer has a lot more
> going on.
(and elsewhere you wrote:
> Analogously, it should be possible to create a suite of Queries/
> Scorers which are optimized for matching alone.
This got me thinking: how would a Scorer optimized for matching alone
be different? The main difference is that wouldn't have any need for
Tally to be called. I think the only problem with using the existing
Scorers for matching is that they do a lot of unnecessary Tallying if
the score is thrown away.
So maybe rather than having Tally called as part of the
$scorer->collect() loop, it could be optionally called by the
HitCollector. A TopDocCollector would call Tally and use the score to
order hits, but a SortCollector wouldn't have to. HitCollector would
be passed a Scorer instead of just a score.
This was a late night idea, but it still seems to make sense to me this morning.
Does it make sense to you?
> PostingList objects are iterators that refill from disk; BitVectors
> are memory only. Few objects refilling from disk means fewer disk
> seeks means better performance.
My intuition is that BitVectors are going to be a win only in very
specific situations. If you can fit your entire index in RAM, the
system page cache is going to treat you well after the first access,
no BitVector needed. If your index is _much_ larger than RAM, the
BitVector is going to be unwieldily large too. If your index is a
size comparable to RAM, even at a 1:32 ratio your (multiple) cached
BitVectors are going to be displacing index that would otherwise have
been cached.
I wonder if one would get more bang by optimizing the index structure
(as you are already doing) than by resorting to BitVectors. My
personal impulse is to figure out how the indexes could be mmap'ed in
and used directly, thus saving the processing time in reconstituting
them and allowing for better shared resources across multiple search
threads/processes. My guess is that this could be a win over even the
BitVectors.
> However, BitVectors don't work well remotely, since they're too big
> to pass around -- that's why SearchClient doesn't support filtering.
This does strike me as a major disadvantage. Almost everything else
I've looked at in KinoSearch seems like it scales quite nicely to
multiple machines.
It would be nice to enable someone to use this for a new Google :).
Nathan Kurz
nate@verse.com