Mailing List Archive

Scaling KinoSearch
One of my long-term ambitions is to create a site integrating
user-ratings and full-text-search, sort of an integration of Google
and Reddit. I spent an evening thinking about how one might scale
KinoSearch to work with a medium sized site of this sort, and thought
I'd document my thoughts. For the purpose of this discussion, I'm
going to define 'medium sized search site' as one with a
non-stop-worded 1 TB index and capable of doing 10 searches per second
with a response time of better than 1 second. It also needs to be
able to scale to both larger indexes and to more requests per second.

My first conclusion was that it will be necessary to split the index
between multiple machines. Even if it was possible to put 1 TB of RAM
in a machine, at a modern processors ~5-10 GB/s for main memory access
it would still take far too long to scan through the posting list for
a common word. Disk access, at around 100 MB/s, is almost too slow
to consider, although if the tail of infrequent words is long enough
it might be possible to save RAM by keeping the smallest posting lists
on disk.

The easiest way to split the index is to split it horizontally into
what Google calls 'shards': the first million documents go to the
first machine, the second million to the next machine, and so forth.
Each machine returns the best matches from its subset of documents,
and a central machine collates this list. Splitting the other way (by
words) might have some small performance benefit, but it also seems
much harder to implement. And the shard approach makes adding new
documents to the index very easy.

I'm guessing that for a an expensive query ("and the"), we'll have to
go through about 10% of the index, or 100 GB. Allowing for incomplete
utilization of memory bandwidth, this would mean we would want 32 GB
of RAM per machine, so that we can scan through 3 GB in under a
second. Even this might be an aggressive presumption: I read an
older Google architecture where they said that due to index
compression they were processor limited and only getting 20%
utilization here. But I think that processors have grown relatively
faster than memory access since that paper.

If we can get through 3 GB/s, this means we need about 30 machines
with 32 GB each. Or 60 with 16 GB if it turns out that we can get
through only 1.5 GB, or only 15 with 64 GB if process 6 GB/s. In
actuality, though, I think this number self-corrects once we try to
account for the number of processors per machine: we just add cores
or processors until we are memory bandwidth limited again. I presume
to be able to saturate the memory bus with two quad-core processors
(or less) and 64 GB.

To keep the latency down, this may require hosting multiple smaller
shards per machine.
Given multiple sockets and NUMA, one might even get a bigger win here
(doubling memory bandwidth) by deft use of processor affinity. I don't
understand this well, but here's an intro:
[www.kernel.org/pub/linux/kernel/people/christoph/pmig/numamemory.pdf].

OK, so with 15 machines we hope to be able to serve one expensive
search per second. Does that mean we need 150 machines to be able to
average 10 searches/s? I hope not: most searches should be much
cheaper than that, probably even 10x cheaper. If we guess that 1/10
of the requests are expensive, we can probably do just fine with 2-3x
this number of machines. It's probably worth noting that these extra
machines don't necessarily need to have the full index --- with a
little more complex request routing one pool could have the full
index, and the other much smaller stop-worded version.


Enough envelope scribblings for tonight. I'm not planning to buy 50
servers anytime soon, but if anyone sees obvious flaws in my reasoning
I'd be glad to hear them.

Nathan Kurz
nate@verse.com


ps. As an aside, splitting the index into multiple shards on a single
machine might be a good way to speed up in RAM (or cached) searches on
a single-user multi-core system as well.

_______________________________________________
KinoSearch mailing list
KinoSearch@rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch
Re: Scaling KinoSearch [ In reply to ]
On May 17, 2008, at 11:19 PM, Nathan Kurz wrote:

> My first conclusion was that it will be necessary to split the index
> between multiple machines. Even if it was possible to put 1 TB of RAM
> in a machine, at a modern processors ~5-10 GB/s for main memory access
> it would still take far too long to scan through the posting list for
> a common word.

I don't know if this would help with the particular app you have in
mind, but in certain cases there may be heuristics which can improve
search-time efficiency at the cost of some accuracy. For instance, if
you have an absolute measure by which you can pre-rank documents --
e.g. a "pagerank" score -- then you don't necessarily need to scan
through the entire posting list at search time. See the docs for
Schema::pre_sort and Searcher's set_prune_factor() method.

> Disk access, at around 100 MB/s, is almost too slow
> to consider, although if the tail of infrequent words is long enough
> it might be possible to save RAM by keeping the smallest posting lists
> on disk.

But since the virtual memory system is going to cache that stuff
anyhow, wouldn't you prefer to leave memory management to the OS?

> The easiest way to split the index is to split it horizontally into
> what Google calls 'shards': the first million documents go to the
> first machine, the second million to the next machine, and so forth.
> Each machine returns the best matches from its subset of documents,
> and a central machine collates this list. Splitting the other way (by
> words) might have some small performance benefit, but it also seems
> much harder to implement.

It's probably also best to divide up machines by tasks. The machines
which are responsible for scanning through posting lists should
probably not also be responsible for retrieving stored docs and
applying highlighting. The search machines should merely return
document identifiers to the boss node(s), which will then request docs
from the storage cluster.

The code for the worker nodes in the scoring cluster might look
something like this:

package MySegReader;
use base qw( KinoSearch::Index::SegReader );

sub make_doc_reader {
my $self = shift;
return PrimaryKeyOnlyDocReader->new(
invindex => $self->get_invindex
);
}

package MyIndexReader;
use base qw( KinoSearch::Index::IndexReader );

sub make_seg_reader {
my $self = shift;
return MySegReader->new( invindex => $self->get_invindex );
}

...

my $reader = MyIndexReader->open(
invindex => MySchema->read('/path/to/invindex'),
);
my $searcher = KinoSearch::Searcher->new( reader => $reader );
my $search_server = KinoSearch::Search::SearchServer->new(
searchable => $searcher,
port => 7890,
password => 'opensesame',
);
$search_server->serve;

The boss node code might look something like this:

package MyMultiSearcher;
use base qw( KinoSearch::Search::MultiSearcher );

our %doc_server;

sub new {
my $either = shift;
my $self = $either->SUPER::new(@_);
$doc_server{$$self} = DocServerCluster->new;
return $self;
}

sub fetch_doc {
my ( $self, $doc_num ) = @_;
my $key = $self->SUPER::fetch_doc($doc_num)->{primary_key};
return $doc_server{$$self}->fetch_doc($key);
}

...

my $schema = MySchema->new;
for my $server_name (@server_names) {
push @searchers, KinoSearch::Search::SearchClient->new(
peer_address => "$server_name:$port",
password => $pass,
schema => $schema,
);
}
my $multi_searcher = KinoSearch::Search::MultiSearcher->new(
searchables => \@searchers,
schema => $schema,
);

...

It's possible to improve that (there's an extra round trip per doc
fetch ), but you get the idea.

> And the shard approach makes adding new
> documents to the index very easy.

This is where I think KS might be able to use the index-time
equivalent of Searchable. (Indexable? Nah. IndexBuilder?).
InvIndexer would be the single-machine subclass. MultiInvIndexer
would be the analogue to MultiSearcher, farming out doc additions to
various worker nodes...

Or maybe it's best to just let solutions arise on their own.
Distributed indexing is easier than distributed searching.

> To keep the latency down, this may require hosting multiple smaller
> shards per machine.

That could work. How about one shard per processor? I think that
might actually be a good way to exploit multiple processors -- better
than ithreads (which KS doesn't do anyhow).

> Given multiple sockets and NUMA, one might even get a bigger win here
> (doubling memory bandwidth) by deft use of processor affinity. I don't
> understand this well, but here's an intro:
> [www.kernel.org/pub/linux/kernel/people/christoph/pmig/
> numamemory.pdf].

On first scan, it makes basic sense to me as another variation on the
theme of locality of reference.

> if anyone sees obvious flaws in my reasoning
> I'd be glad to hear them.

There are some Searchable APIs that don't work with multiple machines
as presently implemented.

First, QueryFilter objects can't be passed between nodes because
they're implemented as cached BitVectors, which are too big to send
over the network with each search. To address this, you'd have to
cook up special SearchServer subclasses that kept their own
QueryFilter objects around and knew how to interpret special requests
from the SearchClient.

Second, KinoSearch's present multi-machine sorting implementation
isn't very well-done IMO.

Last, Searchable's collect() API, soon to go public...

$searcher->collect( query => $query, collector => $hit_collector );

... is single-machine as well, because the HitCollector would have to
be serialized and sent to each node. Working around this limitation
has to be done case-by-case, as it has with TopDocCollector and
Searchable's top_docs() method.

None of these constitutes a fundamental flaw in your model. The
HitCollector limitation is permanent and the other two are just
problems to be worked out. With luck we can solve 'em as elegantly as
we've solved QueryParser.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/


_______________________________________________
KinoSearch mailing list
KinoSearch@rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch
Re: Scaling KinoSearch [ In reply to ]
On Mon, May 19, 2008 at 12:06 AM, Marvin Humphrey
<marvin@rectangular.com> wrote:
>
> On May 17, 2008, at 11:19 PM, Nathan Kurz wrote:
>> Disk access, at around 100 MB/s, is almost too slow
>> to consider, although if the tail of infrequent words is long enough
>> it might be possible to save RAM by keeping the smallest posting lists
>> on disk.
>
> But since the virtual memory system is going to cache that stuff anyhow,
> wouldn't you prefer to leave memory management to the OS?

While doing so would provide good average response time, it would have
a very poor worst-case time. If a search for a very common word is
rare, you might need to read a couple GB per shard. Thus if one wants
a short maximum response time for expensive queries, one needs to keep
the large stuff pegged in memory with something like mlock(). Leaving
the small stuff to be buffered by the system might work, but I'm not
sure the savings is worth the complexity.

> It's probably also best to divide up machines by tasks. The machines which
> are responsible for scanning through posting lists should probably not also
> be responsible for retrieving stored docs and applying highlighting.

Yes, they definitely should at least be logically split. In practice,
it might work well to do these on the same machine, since scanning
should be Memory Bandwidth limited and retrieving will be Disk I/O
limited.

> The
> search machines should merely return document identifiers to the boss
> node(s), which will then request docs from the storage cluster.

Yes, and I'd love to have a simple protocol (likely HTTP) that works
between these pieces.

> Or maybe it's best to just let solutions arise on their own. Distributed
> indexing is easier than distributed searching.

I think this can be thought about later. There are many solutions here
that could work.

>> To keep the latency down, this may require hosting multiple smaller
>> shards per machine.
>
> That could work. How about one shard per processor? I think that might
> actually be a good way to exploit multiple processors -- better than
> ithreads (which KS doesn't do anyhow).

It won't be more than that, but depending on how processor efficient
we can make things, I think it's going to be less than that. It's
going to be different for different setups. I'm fuzzy on the
details, but I think it's going to depend more on the number of
sockets than the number of cores. Probably best to assume it needs to
be configurable.

> First, QueryFilter objects can't be passed between nodes because they're
> implemented as cached BitVectors, which are too big to send over the network
> with each search. To address this, you'd have to cook up special
> SearchServer subclasses that kept their own QueryFilter objects around and
> knew how to interpret special requests from the SearchClient.

I think this is where my vision starts to differ from yours. The
concept of a BitVector implies index centralization, where I'd rather
have a lightweight Boss only loosely connected to each Searcher.
Anything that was previously done with a QueryFilter could instead be
done with an extra AndQuery clause against an unanalyzed field.

> Second, KinoSearch's present multi-machine sorting implementation isn't very
> well-done IMO.

Not a worry. In the short-run this is unlikely to be a problem, and
in the long run it can be optimized.

> Last, Searchable's collect() API, soon to go public...
>
> $searcher->collect( query => $query, collector => $hit_collector );
>
> ... is single-machine as well, because the HitCollector would have to be
> serialized and sent to each node.

I'm not following why HitCollector would need to be serialized, but
this might be a limitation of my procedural world view. I would want
things to be more loosely coupled than this. Rather than using the
same KinoSearch architecture for both the Boss and the Worker, I would
have them interact via a limited and interface (probably via HTTP,
although possibly FCGI or memcached protocol).

The Boss would receive a text query, and pass it on --- as text --- to
a pool of Searchers. The Searchers, using the current KinoSearch
implementation wrapped by a thin persistence layer, would perform the
search and return a list of Documents and Scores. The Boss would
select the highest results from the responses (skipping Workers that
take too long), and (if desired) make a second set of requests to
another pool to get highlighted excerpts.

The Boss would be completely ignorant of how the Searchers search ---
for all it knows, they could be running Lucene, making a SQL query, or
even be proxying to Google. The Searchers would be completely
independent of one another --- even document numbers would not need to
be presumed unique.

Are there disadvantages to this approach? I like its transparency,
and that it doesn't add complexity to the KinoSearch core.

Nathan Kurz
nate@verse.com

_______________________________________________
KinoSearch mailing list
KinoSearch@rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch
Re: Scaling KinoSearch [ In reply to ]
On Wed, May 21, 2008 at 3:35 PM, Nathan Kurz <nate@verse.com> wrote:
> Rather than using the
> same KinoSearch architecture for both the Boss and the Worker, I would
> have them interact via a limited and interface (probably via HTTP,
> although possibly FCGI or memcached protocol).
>
> The Boss would receive a text query, and pass it on --- as text --- to
> a pool of Searchers. The Searchers, using the current KinoSearch
> implementation wrapped by a thin persistence layer, would perform the
> search and return a list of Documents and Scores. The Boss would
> select the highest results from the responses (skipping Workers that
> take too long), and (if desired) make a second set of requests to
> another pool to get highlighted excerpts.
>
> The Boss would be completely ignorant of how the Searchers search ---
> for all it knows, they could be running Lucene, making a SQL query, or
> even be proxying to Google. The Searchers would be completely
> independent of one another --- even document numbers would not need to
> be presumed unique.
>
> Are there disadvantages to this approach? I like its transparency,
> and that it doesn't add complexity to the KinoSearch core.

I've thought more about this approach, and I think it is sound. It
should be pretty easy to implement using existing tools, and it should
scale very well. One nice thing is that it can be in addition to the
internal OO search distribution that KinoSearch already does.

I haven't actually done it, but I think it could be implemented as a
plugin to Perlbal: <http://www.danga.com/perlbal/>. Perlbal is a
single-threaded reverse proxy, normally used to load balance between
pool of backend servers. But it seems pretty straightforward to
modify it to make requests from all of the servers in a pool, and
aggregate the responses. I asked about this on the Perlbal list, and
got a pleasant positive response:
http://lists.danga.com/pipermail/perlbal/2008-May/000969.html

Perlbal would maintain a persistent connection to each of the shards,
whether by HTTP, FCGI, or something custom. These shards would be
grouped into pools, such that if one pool is busy another pool is
used. If all pools are busy, the request is held until a pool frees
up. If a request to a shard takes longer than some threshold time,
it is canceled and the responses from the on-time shards are used.

Perlbal has good a good management interface, and in the future it
should be possible to design a layer on top of this that could respawn
shards as necessary. Or if using a cloud computing approach like
Amazon EC2, it should be possible to actively add and remove pools in
response to demand. I think this can be added at some later date,
though, and does not require immediate thought.

One possible trickiness is that Perlbal's single-threaded event-driven
approach doesn't leave much room for processing the responses. But
finding the top responses from a few dozen shards shouldn't be much of
a problem. I've thought some baout efficient algorithms to do this,
but haven't tried any yet. And the upside of of the single-threaded
approach is that connection caching can be very efficient.

I won't be able to actually get this done for a while, but that
shouldn't stop anyone else from doing it. One of the nice parts about
this approach is that very little needs to be done to the KinoSearch
core to make this possible --- other than a thin wrapper with the
appropriate interface, it's all already there. And while I haven't
thought about it as much, but I think the Doc servers could use the
same approach for getting highlighted excerpts.

Nathan Kurz
nate@verse.com

_______________________________________________
KinoSearch mailing list
KinoSearch@rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch