On Jun 11, 2007, at 2:04 PM, Hans Dieter Pearcey wrote:
> I looked through the MultiSearcher source and saw the dire warnings
> about "the sort cache problem", but I don't know what that means
> and the
> missing support for filters doesn't seem to be described at all.
Technically, MultiSearcher can handle Filters... it's SearchClient
that can't. The problem is the overhead of passing the cached
BitVector in a Filter between client and server. There's no fix
pending, I'm afraid.
The "sort cache problem" is really a problem of collating search
results from disparate searchers. For this one, there is a fix
available in theory, but it's costly in terms of memory.
In KS, the sort cache is an array of 32-bit integers, one per
document, corresponding to the "term number" in the lexicon for the
sort field.
If this is our doc set...
my @docs = (
{ name => 'fido', species => 'dog' }, # doc 0
{ name => 'spot', species => 'dog' }, # doc 1
{ name => 'whiskers', species => 'cat' }, # doc 2
);
... then the lexicon for the field 'species' will look like this...
my @lexicon = (
cat, # term 0
dog, # term 1
);
... and this will be our sort cache:
my @sort_cache = ( 1, 1, 0 );
It's dirt cheap to sort docs by term because of the sort cache: just
look up the term numbers for the two documents and compare.
if ( $sort_cache[$doc_num_a] < $sort_cache[$doc_num_b] ) {
...
}
The problem is that the central machine doesn't have access to the
sort caches on the satellites, so when it comes time to select the
creme de la creme from 5 sets of satellite top 10 hits, the central
machine doesn't know how docs from different satellites compare.
The straightforward solution is to have the satellites send not just
doc numbers to the central machine, but doc numbers accompanied by
sort field values.
# now:
( 451, 20, 52 )
# fixed:
( [ 451, 'cat' ], [ 20, 'dog' ], [ 52, 'human' ] )
Then the central machine could make string comparisons of the sort
field values from different machines.
Looking up those values is kind of expensive though. Lexicons are
only partially kept in memory (1 out of every 128 terms), and with a
multi-segment index, you have to perform 1 disk scan per segment.
Say you have 10 hits and 25 segments. That's 250 disk seeks to
associate each hit with a sort field value. :(
To avoid that cost, we might have to load entire lexicons for sort
fields into memory. I've been trying to avoid that, but I don't see
how.
Marvin Humphrey
Rectangular Research
http://www.rectangular.com/