[ I've moved this discussion to lucene-dev. -drc ]
Dmitry Serebrennikov wrote:
> I was just thinking about doing something similar, but after looking at
> your code I thought couldn't the same thing be done by manipulating the
> mergeFactor on the existing IndexWriter? It already indexes n documents
> into memory before writing a new disk segment. I just looked at it again
> but I can't see without a detailed study whether the mergeFactor applies
> to merging from RAM to disk only or for merging on-disk segments as
> well. If it applies to both, perhaps we could add a different field to
> the IndexWriter to allow the two values to be different? Am I missing
> something?
There'd be more to it than manipulating mergeFactor, but you're right,
IndexWriter already buffers indexes in a RAMDirectory. And it would be
nice to use that RAMDirectory more extensively, controlled separately
from mergeFactor. Peter achieves this by using two IndexWriters, but
probably this should be built-in to IndexWriter.
Also, it would be better to limit things not based on the number of
documents indexed in RAM, but on the amount of memory used.
RAMDirectory could be altered to keep track of how big it is. We could,
by default, allow buffering of up to, say, 10MB of index before things
are flushed. This would speed up and reduce the number of files when
batch indexing.
The changes to RAMDirectory should be straightforward. New buffers are
only allocated in the flushBuffer implementation, and they're only freed
by deleteFile and renameFile.
The changes to IndexWriter are more complicated. The simplest approach
would be to just buffer all single-document indexes in memory and merge
them all at once when flushing. However, merging thousands of documents
at once would use a lot of memory, and is probably not acceptable. So
there probably needs to be two stacks of segment indexes, one for those
in memory and one for those on disk. (The latter is what is contained
in the "segments" file.)
Both could use an algorithm like that used by the current stack. Each
time a new document is added, its index is pushed onto a stack, and
merging is attempted. When segments are merged, the old indexes are
removed from the top of the stack, and the new index replaces them on
the top of the stack.
The current algorithm to figure out when to merge is roughly:
int n = 1;
while (stack[top] through stack[top-mergeFactor]
all contain n or fewer documents) {
merge(stack[top] through stack[top-mergeFactor]);
n *= mergeFactor;
}
The new case that would occur with two stacks is when the in-memory
index is flushed to disk as a single segment. This could put a segment
on top of the disk-based stack that's larger than those beneath it,
which can't currently happen, and breaks the above algorithm. I think
the fix is simple though:
int n = stack[top].numDocs;
while (stack[top] through stack[top-mergeFactor]
all contain n or fewer documents) {
merge(stack[top] through stack[top-mergeFactor]);
n *= mergeFactor;
}
Does that look right? The goals are to:
- keep only O(log(N)) segments, so searching is fast
- only merge each document O(log(N)) times, so indexing is fast
Dmitry, are you interested in working on this?
Doug
--
To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>
Dmitry Serebrennikov wrote:
> I was just thinking about doing something similar, but after looking at
> your code I thought couldn't the same thing be done by manipulating the
> mergeFactor on the existing IndexWriter? It already indexes n documents
> into memory before writing a new disk segment. I just looked at it again
> but I can't see without a detailed study whether the mergeFactor applies
> to merging from RAM to disk only or for merging on-disk segments as
> well. If it applies to both, perhaps we could add a different field to
> the IndexWriter to allow the two values to be different? Am I missing
> something?
There'd be more to it than manipulating mergeFactor, but you're right,
IndexWriter already buffers indexes in a RAMDirectory. And it would be
nice to use that RAMDirectory more extensively, controlled separately
from mergeFactor. Peter achieves this by using two IndexWriters, but
probably this should be built-in to IndexWriter.
Also, it would be better to limit things not based on the number of
documents indexed in RAM, but on the amount of memory used.
RAMDirectory could be altered to keep track of how big it is. We could,
by default, allow buffering of up to, say, 10MB of index before things
are flushed. This would speed up and reduce the number of files when
batch indexing.
The changes to RAMDirectory should be straightforward. New buffers are
only allocated in the flushBuffer implementation, and they're only freed
by deleteFile and renameFile.
The changes to IndexWriter are more complicated. The simplest approach
would be to just buffer all single-document indexes in memory and merge
them all at once when flushing. However, merging thousands of documents
at once would use a lot of memory, and is probably not acceptable. So
there probably needs to be two stacks of segment indexes, one for those
in memory and one for those on disk. (The latter is what is contained
in the "segments" file.)
Both could use an algorithm like that used by the current stack. Each
time a new document is added, its index is pushed onto a stack, and
merging is attempted. When segments are merged, the old indexes are
removed from the top of the stack, and the new index replaces them on
the top of the stack.
The current algorithm to figure out when to merge is roughly:
int n = 1;
while (stack[top] through stack[top-mergeFactor]
all contain n or fewer documents) {
merge(stack[top] through stack[top-mergeFactor]);
n *= mergeFactor;
}
The new case that would occur with two stacks is when the in-memory
index is flushed to disk as a single segment. This could put a segment
on top of the disk-based stack that's larger than those beneath it,
which can't currently happen, and breaks the above algorithm. I think
the fix is simple though:
int n = stack[top].numDocs;
while (stack[top] through stack[top-mergeFactor]
all contain n or fewer documents) {
merge(stack[top] through stack[top-mergeFactor]);
n *= mergeFactor;
}
Does that look right? The goals are to:
- keep only O(log(N)) segments, so searching is fast
- only merge each document O(log(N)) times, so indexing is fast
Dmitry, are you interested in working on this?
Doug
--
To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>