Mailing List Archive

filtering and tallying
On 7/2/07, Marvin Humphrey <marvin@rectangular.com> wrote:
> do {
> ...
> } while (Scorer_Skip_To( self, BitVec_Next_Set_Bit(filter_bits) ));

I was thinking more along the lines of something that would fit into
the search infrastructure. One could create a FilterScorer that would
wrap an existing scorer. Something like (obviously not exact):

$boolean_scorer = $query->scorer;
$filter_scorer = new FilterScorer($boolean_scorer, @filters);
$filter_scorer->collect();

While advancing through documents, FilterScorer works like an
AndScorer, but for scoring it only calls Tally on the internal
BooleanScorer. If one wanted, the actual filters could be
BitVectorScorers that would use a cached BitVector rather than reading
in Postings.

> The downside is that it takes the filtering logic out of HitCollector
> and puts it in Scorer. HitCollector is sparse; Scorer has a lot more
> going on.
(and elsewhere you wrote:
> Analogously, it should be possible to create a suite of Queries/
> Scorers which are optimized for matching alone.

This got me thinking: how would a Scorer optimized for matching alone
be different? The main difference is that wouldn't have any need for
Tally to be called. I think the only problem with using the existing
Scorers for matching is that they do a lot of unnecessary Tallying if
the score is thrown away.

So maybe rather than having Tally called as part of the
$scorer->collect() loop, it could be optionally called by the
HitCollector. A TopDocCollector would call Tally and use the score to
order hits, but a SortCollector wouldn't have to. HitCollector would
be passed a Scorer instead of just a score.

This was a late night idea, but it still seems to make sense to me this morning.
Does it make sense to you?

> PostingList objects are iterators that refill from disk; BitVectors
> are memory only. Few objects refilling from disk means fewer disk
> seeks means better performance.

My intuition is that BitVectors are going to be a win only in very
specific situations. If you can fit your entire index in RAM, the
system page cache is going to treat you well after the first access,
no BitVector needed. If your index is _much_ larger than RAM, the
BitVector is going to be unwieldily large too. If your index is a
size comparable to RAM, even at a 1:32 ratio your (multiple) cached
BitVectors are going to be displacing index that would otherwise have
been cached.

I wonder if one would get more bang by optimizing the index structure
(as you are already doing) than by resorting to BitVectors. My
personal impulse is to figure out how the indexes could be mmap'ed in
and used directly, thus saving the processing time in reconstituting
them and allowing for better shared resources across multiple search
threads/processes. My guess is that this could be a win over even the
BitVectors.

> However, BitVectors don't work well remotely, since they're too big
> to pass around -- that's why SearchClient doesn't support filtering.

This does strike me as a major disadvantage. Almost everything else
I've looked at in KinoSearch seems like it scales quite nicely to
multiple machines.
It would be nice to enable someone to use this for a new Google :).

Nathan Kurz
nate@verse.com
filtering and tallying [ In reply to ]
On Jul 3, 2007, at 2:24 PM, Nathan Kurz wrote:

> On 7/2/07, Marvin Humphrey <marvin@rectangular.com> wrote:
>> do {
>> ...
>> } while (Scorer_Skip_To( self, BitVec_Next_Set_Bit
>> (filter_bits) ));
>
> I was thinking more along the lines of something that would fit into
> the search infrastructure. One could create a FilterScorer that would
> wrap an existing scorer. Something like (obviously not exact):
>
> $boolean_scorer = $query->scorer;
> $filter_scorer = new FilterScorer($boolean_scorer, @filters);
> $filter_scorer->collect();
>
> While advancing through documents, FilterScorer works like an
> AndScorer, but for scoring it only calls Tally on the internal
> BooleanScorer. If one wanted, the actual filters could be
> BitVectorScorers that would use a cached BitVector rather than reading
> in Postings.

Sounds like an idea with potential. I'm not sure you should expect
huge gains over the existing Filter implementation, but I'd be happy
to be surprised.

> how would a Scorer optimized for matching alone
> be different? The main difference is that wouldn't have any need for
> Tally to be called.

There's that. Another big gain would come from a custom Posting type.

MatchPosting: <doc_delta>+
ScorePosting: <doc_delta, freq, shared_boost, <position>+ >+
RichPosting: <doc_delta, freq, <position, boost>+ >+

MatchPosting files have less data in them than ScorePosting files.
(: Or they would if they were actually done. :)

ScorePosting files have less data in them than RichPosting files.

Anyone who wants a brief definition of posting, see...

http://www.rectangular.com/kinosearch/docs/devel/KinoSearch/Docs/
IRTheory.html

See these pages for the initial brainstorming sessions:

http://wiki.apache.org/lucene-java/FlexibleIndexing
http://wiki.apache.org/lucene-java/
ConversationsBetweenDougMarvinAndGrant

The implementation is slightly different in KS than was envisioned
for Lucene. In Lucene, posting information for different fields
resides in shared files. In KS, postings files are broken up by
field -- so there's only one Posting format per file. Also, there's
no Posting class in Lucene, and the Lucene version of PostingList is
TermDocs, which isn't quite the same. In Lucene, you subclass
TermDocs to get more data per document. In KS, you subclass Posting,
but you still use a PostingList.

Eventually, the idea is that people like yourself will create custom
Posting subclasses to supply custom Scorer subclasses with exactly
the data they need and no more.

> I think the only problem with using the existing
> Scorers for matching is that they do a lot of unnecessary Tallying if
> the score is thrown away.
>
> So maybe rather than having Tally called as part of the
> $scorer->collect() loop, it could be optionally called by the
> HitCollector. A TopDocCollector would call Tally and use the score to
> order hits, but a SortCollector wouldn't have to.

Sure.

> HitCollector would
> be passed a Scorer instead of just a score.

FWIW, Tally doesn't exist in Lucene, which only uses scorer.doc() and
scorer.score(). Tally is supposed to be the vessel which carries
"your score and more!". :)
>
> This was a late night idea, but it still seems to make sense to me
> this morning.
> Does it make sense to you?
>
>> PostingList objects are iterators that refill from disk; BitVectors
>> are memory only. Few objects refilling from disk means fewer disk
>> seeks means better performance.
>
> My intuition is that BitVectors are going to be a win only in very
> specific situations. If you can fit your entire index in RAM, the
> system page cache is going to treat you well after the first access,
> no BitVector needed. If your index is _much_ larger than RAM, the
> BitVector is going to be unwieldily large too. If your index is a
> size comparable to RAM, even at a 1:32 ratio your (multiple) cached
> BitVectors are going to be displacing index that would otherwise have
> been cached.
>
> I wonder if one would get more bang by optimizing the index structure
> (as you are already doing) than by resorting to BitVectors. My
> personal impulse is to figure out how the indexes could be mmap'ed in
> and used directly, thus saving the processing time in reconstituting
> them and allowing for better shared resources across multiple search
> threads/processes. My guess is that this could be a win over even the
> BitVectors.

Possible. We'll just need to benchmark it.

>> However, BitVectors don't work well remotely, since they're too big
>> to pass around -- that's why SearchClient doesn't support filtering.
>
> This does strike me as a major disadvantage. Almost everything else
> I've looked at in KinoSearch seems like it scales quite nicely to
> multiple machines.
> It would be nice to enable someone to use this for a new Google :).

Yes. Google's search results are the the sum of many small indexes,
and in another sense, the sum of many scoring algorithms. The KS
infrastructure should scale up so that at least large specialized
search engines are possible. For the scoring algorithms, we'll have
to work together as a community. :)

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/