Mailing List Archive: MultiSearcher's lack of features

MultiSearcher's lack of features

Jun 11, 2007, 2:13 PM

Post #1 of 5 (1911 views)

I'd really like to use MultiSearcher; as the docs note, it's nice to spread
load over several different machines. Right now, though, the lack of filtering
and sorting kills it for me.

Is it documented anywhere what would need to be done in order to add those
features? I looked through the MultiSearcher source and saw the dire warnings
about "the sort cache problem", but I don't know what that means and the
missing support for filters doesn't seem to be described at all.

hdp.

MultiSearcher's lack of features [ In reply to ]

marvin at rectangular

Jun 11, 2007, 5:05 PM

Post #2 of 5 (1840 views)

Permalink

On Jun 11, 2007, at 2:04 PM, Hans Dieter Pearcey wrote:

> I looked through the MultiSearcher source and saw the dire warnings
> about "the sort cache problem", but I don't know what that means
> and the
> missing support for filters doesn't seem to be described at all.

Technically, MultiSearcher can handle Filters... it's SearchClient
that can't. The problem is the overhead of passing the cached
BitVector in a Filter between client and server. There's no fix
pending, I'm afraid.

The "sort cache problem" is really a problem of collating search
results from disparate searchers. For this one, there is a fix
available in theory, but it's costly in terms of memory.

In KS, the sort cache is an array of 32-bit integers, one per
document, corresponding to the "term number" in the lexicon for the
sort field.

If this is our doc set...

my @docs = (
{ name => 'fido', species => 'dog' }, # doc 0
{ name => 'spot', species => 'dog' }, # doc 1
{ name => 'whiskers', species => 'cat' }, # doc 2
);

... then the lexicon for the field 'species' will look like this...

my @lexicon = (
cat, # term 0
dog, # term 1
);

... and this will be our sort cache:

my @sort_cache = ( 1, 1, 0 );

It's dirt cheap to sort docs by term because of the sort cache: just
look up the term numbers for the two documents and compare.

if ( $sort_cache[$doc_num_a] < $sort_cache[$doc_num_b] ) {
...
}

The problem is that the central machine doesn't have access to the
sort caches on the satellites, so when it comes time to select the
creme de la creme from 5 sets of satellite top 10 hits, the central
machine doesn't know how docs from different satellites compare.

The straightforward solution is to have the satellites send not just
doc numbers to the central machine, but doc numbers accompanied by
sort field values.

# now:
( 451, 20, 52 )
# fixed:
( [ 451, 'cat' ], [ 20, 'dog' ], [ 52, 'human' ] )

Then the central machine could make string comparisons of the sort
field values from different machines.

Looking up those values is kind of expensive though. Lexicons are
only partially kept in memory (1 out of every 128 terms), and with a
multi-segment index, you have to perform 1 disk scan per segment.
Say you have 10 hits and 25 segments. That's 250 disk seeks to
associate each hit with a sort field value. :(

To avoid that cost, we might have to load entire lexicons for sort
fields into memory. I've been trying to avoid that, but I don't see
how.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/

MultiSearcher's lack of features [ In reply to ]

henka at cityweb

Jun 12, 2007, 2:29 AM

Post #3 of 5 (1838 views)

Permalink

Oi - multi-search without sorting is an app killer in my case too...

> # now:
> ( 451, 20, 52 )
> # fixed:
> ( [ 451, 'cat' ], [ 20, 'dog' ], [ 52, 'human' ] )
>
> Then the central machine could make string comparisons of the sort
> field values from different machines.
>
> Looking up those values is kind of expensive though. Lexicons are
> only partially kept in memory (1 out of every 128 terms), and with a
> multi-segment index, you have to perform 1 disk scan per segment.
> Say you have 10 hits and 25 segments. That's 250 disk seeks to
> associate each hit with a sort field value. :(

Very bad indeed. That would potentially murder search times on a busy
search cluster, right? (IO being the bottlenek)

> To avoid that cost, we might have to load entire lexicons for sort
> fields into memory. I've been trying to avoid that, but I don't see
> how.

Just so I understand: when you say "load entire lexicons for sort fields
into memory" you mean the sort fields of the -search result set-, right?

I'm trying to get my mind around what kind of memory|performance
requirements this would imply for a given (corpus size)/search_nodes x
search activity|volume for planning purposes.

Regards
Henry

MultiSearcher's lack of features [ In reply to ]

marvin at rectangular

Jun 12, 2007, 7:55 AM

Post #4 of 5 (1838 views)

Permalink

On Jun 12, 2007, at 2:28 AM, Henka wrote:

>> Looking up those values is kind of expensive though. Lexicons are
>> only partially kept in memory (1 out of every 128 terms), and with a
>> multi-segment index, you have to perform 1 disk scan per segment.
>> Say you have 10 hits and 25 segments. That's 250 disk seeks to
>> associate each hit with a sort field value. :(
>
> Very bad indeed. That would potentially murder search times on a busy
> search cluster, right? (IO being the bottlenek)

I think I've overstated the problem.

First... if we go to full caching of sort fields, it's a memory hit,
but not an unmanageable one. The full caching strategy is, in fact,
EXACTLY how Lucene does things. I'd hoped to improve on it, but
maybe that won't be possible this time.

Second... best practice for a busy search cluster would be to
optimize the indexes on most nodes, so that each index contains a
single segment. Then you're looking at 1 disk seek per hit, and
they're all coming at the same time and are all concentrated in the
same spot in the same file. OS caching will help some, and the fact
that the disk head wouldn't have to travel far will also make those
seeks comparatively inexpensive. Search performance will continue to
be dominated by the time it takes to to score large numbers of
matches for common terms. These lookups won't be a primary
consideration.

For busy search clusters that must keep indexes updated frequently,
you should dedicate one machine to rapidly changing data, while all
the rest handle older, stable data and stay optimized. The rapid-
update index will necessarily be multi-segment, but if you keep it
small, the costs should be manageable.

I intend to write KinoSearch::Docs::CookBook::ScalingUp describing
this architecture. (: But not this week. :) Best practice will
remain the same no matter what system we adopt to handle the sort
caching issue.

>> To avoid that cost, we might have to load entire lexicons for sort
>> fields into memory. I've been trying to avoid that, but I don't see
>> how.
>
> Just so I understand: when you say "load entire lexicons for sort
> fields
> into memory" you mean the sort fields of the -search result set-,
> right?

Say you are sorting by 'date'.

At present, we keep 1 out of every 128 values for the date field in
memory -- the contents of the .lexx file (LEXicon indeX). When we
need to find a particular value, we look it up in this index, which
tells us the general location on disk. Then we scan a small portion
the full .lex file to find the exact term.

What we might do instead is load ALL 'date' values into memory (the
full .lex file) -- then we wouldn't need to touch the disk again.
The memory costs of doing this depend on how many unique dates you have.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/

MultiSearcher's lack of features [ In reply to ]

henka at cityweb

Jun 12, 2007, 8:38 AM

Post #5 of 5 (1837 views)

Permalink

> First... if we go to full caching of sort fields, it's a memory hit,
> but not an unmanageable one. The full caching strategy is, in fact,
> EXACTLY how Lucene does things. I'd hoped to improve on it, but
> maybe that won't be possible this time.

Understood. I think you've done a bloody fine job anyway!

> Second... best practice for a busy search cluster would be to
> optimize the indexes on most nodes, so that each index contains a
> single segment. Then you're looking at 1 disk seek per hit, and
> they're all coming at the same time and are all concentrated in the
> same spot in the same file. OS caching will help some, and the fact
> that the disk head wouldn't have to travel far will also make those
> seeks comparatively inexpensive. Search performance will continue to
> be dominated by the time it takes to to score large numbers of
> matches for common terms. These lookups won't be a primary
> consideration.

OK - I was intending to always optimize the final indexes on each cluster
node.

> For busy search clusters that must keep indexes updated frequently,
> you should dedicate one machine to rapidly changing data, while all
> the rest handle older, stable data and stay optimized. The rapid-
> update index will necessarily be multi-segment, but if you keep it
> small, the costs should be manageable.

For the time being, I was thinking of a nice simple approach: the search
node indexes are always optimized and are 'refreshed' (ie, overwritten -
quickly, during the graveyard shift) whenever the indexing cycle dictates
(once a week, month, whatever) - indexing and merging occuring outside the
search nodes on indexing cluster nodes.

However, I like your idea of having (a) seperate search node(s) with data
which is in flux and un-optimized... hell, this is turning out to be
oodles of fun! <rubbing hands>

>>> To avoid that cost, we might have to load entire lexicons for sort
>>> fields into memory. I've been trying to avoid that, but I don't see
>>> how.
>>
>> Just so I understand: when you say "load entire lexicons for sort
>> fields
>> into memory" you mean the sort fields of the -search result set-,
>> right?
>
> Say you are sorting by 'date'.
>
> At present, we keep 1 out of every 128 values for the date field in
> memory -- the contents of the .lexx file (LEXicon indeX). When we
> need to find a particular value, we look it up in this index, which
> tells us the general location on disk. Then we scan a small portion
> the full .lex file to find the exact term.
>
> What we might do instead is load ALL 'date' values into memory (the
> full .lex file) -- then we wouldn't need to touch the disk again.
> The memory costs of doing this depend on how many unique dates you have.

So far, about 30 million. Even quadruple that, sorting by float - spread
across multiple beefy nodes, things are fine. This means my initial
knee-jerk was unfounded.

Thanks for the detailed explanation.
Henry