On Aug 28, 2006, at 4:04 PM, Simon Wistow wrote:
> We're currently looking at about 10 Tb of data
FYI, I don't know of anyone who's built a KS-based search engine on
that scale. I'd really like to see somebody make the attempt, but I
can't say what will happen. ;)
If the search-time performance turns out to be too slow, eventually
KS will have a cluster system available. Multiple machines would
need to communicate with each other prior to the main search
processing: each term needs to get an IDF in order for scoring to
proceed, and you have to know the total size of the collection and
the number of docs the term appears in to calculate it. If a search
requested 10 docs, each machine would produce 10, then the top 10
scorers out of the aggregate group would be your final search
results. If you want to see how this can work, take a look at
Lucene's MultiSearcher class.
Working up such a cluster system is not currently at the top of my to-
do list. (Finding steady work is, followed by FieldCache, Lucy, etc...)
> to index with approx 16 new updates a second which we'd like to merge
> within about 40 seconds so any help would be gratefully appreciated.
The only way you're going to be able to manage this is if a new
option gets added to finish(), and even then it will be challenging.
This would also hold true for Lucene, because the problem is the
same: inverted index data structures are optimized for fast
searching, not instant updates.
Merging of segments takes an unpredictable amount of time. Usually
it will be pretty quick, but every once in a while, when large
segments get merged, it will take a while.
Furthermore, when you update an index, you need to open a new
Searcher and warm up its caches. That cost is much easier to
predict, though, as the warm-up time tracks the size of the index
closely.
Incremental indexing in KS/Lucene works something like this:
Add a book, generate an index.
Add another chapter, generate a second index. Now you need to
leaf through two indexes.
Add another chapter, generate a third index. Now you've got
three indexes to look through each time you search.
...
Those indexes get merged periodically, and the interleaving process
takes time proportional to the size of the segments being merged.
The more segments you have, the longer it takes to search, though the
time is still much more dependent on absolute size of index than on
number of segments. Optimizing the index via...
$invindexer->finish( optimize => 1 );
... reads all existing segments and dumps them into the one currently
being written, which is costly at index-time but produces a single-
segment index for fastest searching. Ordinarily, KS tries to keep
segments around in something approximating a Fibonacci series (as an
aside, this is different from Lucene) -- the idea being to minimize
the number of merges, while still providing good search-time
performance.
What we would need to do is STOP KinoSearch from merging any existing
segments into the segment currently in progress (that's the API
that's missing). Then every once in a while, you'd force-optimize
the index -- probably off-line to a copy which gets swapped in when
the process completes. It's a reasonable strategy, but the number of
updates in your spec is quite substantial.
Cheers,
Marvin Humphrey
--
I'm looking for a part time job.