One of my long-term ambitions is to create a site integrating
user-ratings and full-text-search, sort of an integration of Google
and Reddit. I spent an evening thinking about how one might scale
KinoSearch to work with a medium sized site of this sort, and thought
I'd document my thoughts. For the purpose of this discussion, I'm
going to define 'medium sized search site' as one with a
non-stop-worded 1 TB index and capable of doing 10 searches per second
with a response time of better than 1 second. It also needs to be
able to scale to both larger indexes and to more requests per second.
My first conclusion was that it will be necessary to split the index
between multiple machines. Even if it was possible to put 1 TB of RAM
in a machine, at a modern processors ~5-10 GB/s for main memory access
it would still take far too long to scan through the posting list for
a common word. Disk access, at around 100 MB/s, is almost too slow
to consider, although if the tail of infrequent words is long enough
it might be possible to save RAM by keeping the smallest posting lists
on disk.
The easiest way to split the index is to split it horizontally into
what Google calls 'shards': the first million documents go to the
first machine, the second million to the next machine, and so forth.
Each machine returns the best matches from its subset of documents,
and a central machine collates this list. Splitting the other way (by
words) might have some small performance benefit, but it also seems
much harder to implement. And the shard approach makes adding new
documents to the index very easy.
I'm guessing that for a an expensive query ("and the"), we'll have to
go through about 10% of the index, or 100 GB. Allowing for incomplete
utilization of memory bandwidth, this would mean we would want 32 GB
of RAM per machine, so that we can scan through 3 GB in under a
second. Even this might be an aggressive presumption: I read an
older Google architecture where they said that due to index
compression they were processor limited and only getting 20%
utilization here. But I think that processors have grown relatively
faster than memory access since that paper.
If we can get through 3 GB/s, this means we need about 30 machines
with 32 GB each. Or 60 with 16 GB if it turns out that we can get
through only 1.5 GB, or only 15 with 64 GB if process 6 GB/s. In
actuality, though, I think this number self-corrects once we try to
account for the number of processors per machine: we just add cores
or processors until we are memory bandwidth limited again. I presume
to be able to saturate the memory bus with two quad-core processors
(or less) and 64 GB.
To keep the latency down, this may require hosting multiple smaller
shards per machine.
Given multiple sockets and NUMA, one might even get a bigger win here
(doubling memory bandwidth) by deft use of processor affinity. I don't
understand this well, but here's an intro:
[www.kernel.org/pub/linux/kernel/people/christoph/pmig/numamemory.pdf].
OK, so with 15 machines we hope to be able to serve one expensive
search per second. Does that mean we need 150 machines to be able to
average 10 searches/s? I hope not: most searches should be much
cheaper than that, probably even 10x cheaper. If we guess that 1/10
of the requests are expensive, we can probably do just fine with 2-3x
this number of machines. It's probably worth noting that these extra
machines don't necessarily need to have the full index --- with a
little more complex request routing one pool could have the full
index, and the other much smaller stop-worded version.
Enough envelope scribblings for tonight. I'm not planning to buy 50
servers anytime soon, but if anyone sees obvious flaws in my reasoning
I'd be glad to hear them.
Nathan Kurz
nate@verse.com
ps. As an aside, splitting the index into multiple shards on a single
machine might be a good way to speed up in RAM (or cached) searches on
a single-user multi-core system as well.
_______________________________________________
KinoSearch mailing list
KinoSearch@rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch
user-ratings and full-text-search, sort of an integration of Google
and Reddit. I spent an evening thinking about how one might scale
KinoSearch to work with a medium sized site of this sort, and thought
I'd document my thoughts. For the purpose of this discussion, I'm
going to define 'medium sized search site' as one with a
non-stop-worded 1 TB index and capable of doing 10 searches per second
with a response time of better than 1 second. It also needs to be
able to scale to both larger indexes and to more requests per second.
My first conclusion was that it will be necessary to split the index
between multiple machines. Even if it was possible to put 1 TB of RAM
in a machine, at a modern processors ~5-10 GB/s for main memory access
it would still take far too long to scan through the posting list for
a common word. Disk access, at around 100 MB/s, is almost too slow
to consider, although if the tail of infrequent words is long enough
it might be possible to save RAM by keeping the smallest posting lists
on disk.
The easiest way to split the index is to split it horizontally into
what Google calls 'shards': the first million documents go to the
first machine, the second million to the next machine, and so forth.
Each machine returns the best matches from its subset of documents,
and a central machine collates this list. Splitting the other way (by
words) might have some small performance benefit, but it also seems
much harder to implement. And the shard approach makes adding new
documents to the index very easy.
I'm guessing that for a an expensive query ("and the"), we'll have to
go through about 10% of the index, or 100 GB. Allowing for incomplete
utilization of memory bandwidth, this would mean we would want 32 GB
of RAM per machine, so that we can scan through 3 GB in under a
second. Even this might be an aggressive presumption: I read an
older Google architecture where they said that due to index
compression they were processor limited and only getting 20%
utilization here. But I think that processors have grown relatively
faster than memory access since that paper.
If we can get through 3 GB/s, this means we need about 30 machines
with 32 GB each. Or 60 with 16 GB if it turns out that we can get
through only 1.5 GB, or only 15 with 64 GB if process 6 GB/s. In
actuality, though, I think this number self-corrects once we try to
account for the number of processors per machine: we just add cores
or processors until we are memory bandwidth limited again. I presume
to be able to saturate the memory bus with two quad-core processors
(or less) and 64 GB.
To keep the latency down, this may require hosting multiple smaller
shards per machine.
Given multiple sockets and NUMA, one might even get a bigger win here
(doubling memory bandwidth) by deft use of processor affinity. I don't
understand this well, but here's an intro:
[www.kernel.org/pub/linux/kernel/people/christoph/pmig/numamemory.pdf].
OK, so with 15 machines we hope to be able to serve one expensive
search per second. Does that mean we need 150 machines to be able to
average 10 searches/s? I hope not: most searches should be much
cheaper than that, probably even 10x cheaper. If we guess that 1/10
of the requests are expensive, we can probably do just fine with 2-3x
this number of machines. It's probably worth noting that these extra
machines don't necessarily need to have the full index --- with a
little more complex request routing one pool could have the full
index, and the other much smaller stop-worded version.
Enough envelope scribblings for tonight. I'm not planning to buy 50
servers anytime soon, but if anyone sees obvious flaws in my reasoning
I'd be glad to hear them.
Nathan Kurz
nate@verse.com
ps. As an aside, splitting the index into multiple shards on a single
machine might be a good way to speed up in RAM (or cached) searches on
a single-user multi-core system as well.
_______________________________________________
KinoSearch mailing list
KinoSearch@rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch