Mailing List Archive: Re: batch indexing

[ I've moved this discussion to lucene-dev. -drc ]

Dmitry Serebrennikov wrote:
> I was just thinking about doing something similar, but after looking at
> your code I thought couldn't the same thing be done by manipulating the
> mergeFactor on the existing IndexWriter? It already indexes n documents
> into memory before writing a new disk segment. I just looked at it again
> but I can't see without a detailed study whether the mergeFactor applies
> to merging from RAM to disk only or for merging on-disk segments as
> well. If it applies to both, perhaps we could add a different field to
> the IndexWriter to allow the two values to be different? Am I missing
> something?

There'd be more to it than manipulating mergeFactor, but you're right,
IndexWriter already buffers indexes in a RAMDirectory. And it would be
nice to use that RAMDirectory more extensively, controlled separately
from mergeFactor. Peter achieves this by using two IndexWriters, but
probably this should be built-in to IndexWriter.

Also, it would be better to limit things not based on the number of
documents indexed in RAM, but on the amount of memory used.
RAMDirectory could be altered to keep track of how big it is. We could,
by default, allow buffering of up to, say, 10MB of index before things
are flushed. This would speed up and reduce the number of files when
batch indexing.

The changes to RAMDirectory should be straightforward. New buffers are
only allocated in the flushBuffer implementation, and they're only freed
by deleteFile and renameFile.

The changes to IndexWriter are more complicated. The simplest approach
would be to just buffer all single-document indexes in memory and merge
them all at once when flushing. However, merging thousands of documents
at once would use a lot of memory, and is probably not acceptable. So
there probably needs to be two stacks of segment indexes, one for those
in memory and one for those on disk. (The latter is what is contained
in the "segments" file.)

Both could use an algorithm like that used by the current stack. Each
time a new document is added, its index is pushed onto a stack, and
merging is attempted. When segments are merged, the old indexes are
removed from the top of the stack, and the new index replaces them on
the top of the stack.

The current algorithm to figure out when to merge is roughly:

int n = 1;
while (stack[top] through stack[top-mergeFactor]
all contain n or fewer documents) {
merge(stack[top] through stack[top-mergeFactor]);
n *= mergeFactor;
}

The new case that would occur with two stacks is when the in-memory
index is flushed to disk as a single segment. This could put a segment
on top of the disk-based stack that's larger than those beneath it,
which can't currently happen, and breaks the above algorithm. I think
the fix is simple though:

int n = stack[top].numDocs;
while (stack[top] through stack[top-mergeFactor]
all contain n or fewer documents) {
merge(stack[top] through stack[top-mergeFactor]);
n *= mergeFactor;
}

Does that look right? The goals are to:
- keep only O(log(N)) segments, so searching is fast
- only merge each document O(log(N)) times, so indexing is fast

Dmitry, are you interested in working on this?

Doug

--
To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>

Doug Cutting wrote:

> [ I've moved this discussion to lucene-dev. -drc ]
>
> Dmitry Serebrennikov wrote:
>
>> I was just thinking about doing something similar, but after looking
>> at your code I thought couldn't the same thing be done by
>> manipulating the mergeFactor on the existing IndexWriter? It already
>> indexes n documents into memory before writing a new disk segment. I
>> just looked at it again but I can't see without a detailed study
>> whether the mergeFactor applies to merging from RAM to disk only or
>> for merging on-disk segments as well. If it applies to both, perhaps
>> we could add a different field to the IndexWriter to allow the two
>> values to be different? Am I missing something?
>
>
> There'd be more to it than manipulating mergeFactor, but you're right,
> IndexWriter already buffers indexes in a RAMDirectory. And it would
> be nice to use that RAMDirectory more extensively, controlled
> separately from mergeFactor. Peter achieves this by using two
> IndexWriters, but probably this should be built-in to IndexWriter.
>
> Also, it would be better to limit things not based on the number of
> documents indexed in RAM, but on the amount of memory used.
> RAMDirectory could be altered to keep track of how big it is. We
> could, by default, allow buffering of up to, say, 10MB of index before
> things are flushed. This would speed up and reduce the number of
> files when batch indexing.
>
> The changes to RAMDirectory should be straightforward. New buffers
> are only allocated in the flushBuffer implementation, and they're only
> freed by deleteFile and renameFile.
>
> The changes to IndexWriter are more complicated. The simplest
> approach would be to just buffer all single-document indexes in memory
> and merge them all at once when flushing. However, merging thousands
> of documents at once would use a lot of memory, and is probably not
> acceptable. So there probably needs to be two stacks of segment
> indexes, one for those in memory and one for those on disk. (The
> latter is what is contained in the "segments" file.)
>
> Both could use an algorithm like that used by the current stack. Each
> time a new document is added, its index is pushed onto a stack, and
> merging is attempted. When segments are merged, the old indexes are
> removed from the top of the stack, and the new index replaces them on
> the top of the stack.
>
> The current algorithm to figure out when to merge is roughly:
>
> int n = 1;
> while (stack[top] through stack[top-mergeFactor]
> all contain n or fewer documents) {
> merge(stack[top] through stack[top-mergeFactor]);
> n *= mergeFactor;
> }
>
> The new case that would occur with two stacks is when the in-memory
> index is flushed to disk as a single segment. This could put a
> segment on top of the disk-based stack that's larger than those
> beneath it, which can't currently happen, and breaks the above
> algorithm. I think the fix is simple though:
>
> int n = stack[top].numDocs;
> while (stack[top] through stack[top-mergeFactor]
> all contain n or fewer documents) {
> merge(stack[top] through stack[top-mergeFactor]);
> n *= mergeFactor;
> }
>
> Does that look right? The goals are to:
> - keep only O(log(N)) segments, so searching is fast
> - only merge each document O(log(N)) times, so indexing is fast
>
> Dmitry, are you interested in working on this?

Well, frankly, I didn't realize how complicated this was... Yes, I am
interested in working on this and it would be a great addition to
Lucene, but there are other things in my work queue way ahead of this
one. Peter's solution suddenly looks much more appealing :)

Dmitry.

>
>
> Doug
>
>
>
> --
> To unsubscribe, e-mail:
> <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
> For additional commands, e-mail:
> <mailto:lucene-dev-help@jakarta.apache.org>
>
>

--
To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>