Mailing List Archive

Distributed indexes
Has anybody done any work on massive, distributed indexes and merging
between them.

Or would the sensible option be to seperate the documents into buckets
and then merge the results from each of those?

Simon
Distributed indexes [ In reply to ]
> Has anybody done any work on massive, distributed indexes and merging
> between them.
>
> Or would the sensible option be to seperate the documents into buckets
> and then merge the results from each of those?

Yes, we're busy with exactly that. Running a single indexer is not
feasible when you're talking about several terabytes of raw data which
needs to be re-indexed periodically. Since the master index is locked
during updates, and if you need to run many parallel indexers (eg if you
want to take advantage of cluster setups for performance with a cloud of
indexers migrating to where there are resources), you need to use temp
indexes then merge them with add_invindexes() (and of course, first delete
existing docs in the master index using delete_docs_by_term()).

Marvin has been incredibly helpful with answering some of our more dumb
questions in this regard. Things just get a little uncertain (design and
performance wise) when you start crunching really large data sets.

Regards
Distributed indexes [ In reply to ]
On Thu, Aug 24, 2006 at 09:25:36AM +0200, henka@cityweb.co.za said:
> Marvin has been incredibly helpful with answering some of our more dumb
> questions in this regard. Things just get a little uncertain (design and
> performance wise) when you start crunching really large data sets.

That's useful to know. I'd be interested in talking to you about any
problems and workarounds you've encountered and some tips and tricks on
doing exactly the same. We're currently looking at about 10 Tb of data
to index with approx 16 new updates a second which we'd like to merge
within about 40 seconds so any help would be gratefully appreciated.

Simon
Distributed indexes [ In reply to ]
On Aug 28, 2006, at 4:04 PM, Simon Wistow wrote:

> We're currently looking at about 10 Tb of data

FYI, I don't know of anyone who's built a KS-based search engine on
that scale. I'd really like to see somebody make the attempt, but I
can't say what will happen. ;)

If the search-time performance turns out to be too slow, eventually
KS will have a cluster system available. Multiple machines would
need to communicate with each other prior to the main search
processing: each term needs to get an IDF in order for scoring to
proceed, and you have to know the total size of the collection and
the number of docs the term appears in to calculate it. If a search
requested 10 docs, each machine would produce 10, then the top 10
scorers out of the aggregate group would be your final search
results. If you want to see how this can work, take a look at
Lucene's MultiSearcher class.

Working up such a cluster system is not currently at the top of my to-
do list. (Finding steady work is, followed by FieldCache, Lucy, etc...)

> to index with approx 16 new updates a second which we'd like to merge
> within about 40 seconds so any help would be gratefully appreciated.

The only way you're going to be able to manage this is if a new
option gets added to finish(), and even then it will be challenging.
This would also hold true for Lucene, because the problem is the
same: inverted index data structures are optimized for fast
searching, not instant updates.

Merging of segments takes an unpredictable amount of time. Usually
it will be pretty quick, but every once in a while, when large
segments get merged, it will take a while.

Furthermore, when you update an index, you need to open a new
Searcher and warm up its caches. That cost is much easier to
predict, though, as the warm-up time tracks the size of the index
closely.

Incremental indexing in KS/Lucene works something like this:

Add a book, generate an index.

Add another chapter, generate a second index. Now you need to
leaf through two indexes.

Add another chapter, generate a third index. Now you've got
three indexes to look through each time you search.

...

Those indexes get merged periodically, and the interleaving process
takes time proportional to the size of the segments being merged.

The more segments you have, the longer it takes to search, though the
time is still much more dependent on absolute size of index than on
number of segments. Optimizing the index via...

$invindexer->finish( optimize => 1 );

... reads all existing segments and dumps them into the one currently
being written, which is costly at index-time but produces a single-
segment index for fastest searching. Ordinarily, KS tries to keep
segments around in something approximating a Fibonacci series (as an
aside, this is different from Lucene) -- the idea being to minimize
the number of merges, while still providing good search-time
performance.

What we would need to do is STOP KinoSearch from merging any existing
segments into the segment currently in progress (that's the API
that's missing). Then every once in a while, you'd force-optimize
the index -- probably off-line to a copy which gets swapped in when
the process completes. It's a reasonable strategy, but the number of
updates in your spec is quite substantial.

Cheers,

Marvin Humphrey

--
I'm looking for a part time job.