Mailing List Archive

File Handles issue
We're having a heck of a time with too many file handles around here. When
we create large indexes, we often get thousands of temporary files in a
given index! Even worse, we just plain run out of file handles--even on
boxes where we've upped the limits as much as we think we can! We've played
around with various settings for the mergeFactor and maxMergeDocs, but these
seem to have at best an indirect effect on the number of temporary files
created.

I'm not very familiar with the Lucene file system yet, so can someone
briefly explain how Lucene works on creating an index? How does it
determine when to create a new temporary file in the index and when does it
decide to compress the index? Also, is there any way we could limit the
number of file handles used by Lucene?

This is becoming a huge problem for us, so any insight would be appreciated.

Thanks,
Scott
RE: File Handles issue [ In reply to ]
> From: Scott Ganyo [mailto:scott.ganyo@eTapestry.com]
>
> We're having a heck of a time with too many file handles
> around here. When
> we create large indexes, we often get thousands of temporary
> files in a given index!

Thousands, eh? That seems high.

The maximum number of segments should be f*log_f(N), where f is the
IndexWriter.mergeFactor and N is the number of documents. The default merge
factor is ten. There are seven files per segment, plus one per field. If
we assume that you have three fields per document, then its ten files per
segment. So to get 1000 files in an index with three fields and a
mergeFactor of ten, you'd need 10 billion documents, which I doubt you have.
(Lucene can't handle more than 2 billion anyway...)

How many fields do you have? (How many different .f files are there per
segment?)

Have you lowered IndexWriter.maxMergeDocs? If you, e.g. lowered this to
10,000, then with a million documents you'd have 100 segments, which would
give you 1000 files. So, to minimize the number of files, keep maxMergeDocs
at Integer.MAX_VALUE, its default.

Another possibility is that you're running on Win32 and obsolete files are
being kept open by IndexReaders and cannot be deleted. Could that be the
case?

> Even worse, we just plain run out of file
> handles--even on
> boxes where we've upped the limits as much as we think we
> can!

You should endevour to keep just one IndexReader at a time for an index.
When it is out of date, don't close it, as this could break queries running
in other threads, just let it get garbage collected. The finalizers will
close things and free the file handles.

> I'm not very familiar with the Lucene file system yet, so can someone
> briefly explain how Lucene works on creating an index? How does it
> determine when to create a new temporary file in the index
> and when does it
> decide to compress the index?

Assume mergeFactor is ten, the default. A new segment is created on disk
for every ten documents added, or sooner if IndexWriter.close() is called
before ten have been added. When the tenth segment of size ten is added,
all ten are merged into a single segment of size 100. When ten such
segments of size 100 have been added, these are merged into a single segment
containing 1000 documents, and so on. So at any time there can be no more
than nine segments in each power-of-ten index size. When optimize() is
called all segments are merged into a single segment.

The exception is that no segments will be created larger than
IndexWriter.maxMergeDocs. So if this were set to 1000, then when you add
the 10,000th document, instead of merging things into a single segment of
10,000, it would add a tenth segment of size 1000, and keep adding segments
of size 1000 for every 1000 documents added.

> Also, is there any way we
> could limit the
> number of file handles used by Lucene?

An IndexReader keeps all files in all segments open while it is open. So to
minimize the number of file handles you should minimize the number of
segments, minimize the number of fields, and minimize the number of
IndexReaders open at once.

An IndexWriter also has all files in all segments open at once. So updating
in a separate process would also buy you more file handles.

Doug
RE: File Handles issue [ In reply to ]
Thanks for the detailed information, Doug! That helps a lot.

Based on what you've said and on taking a closer look at the code, it looks
like by setting mergeFactor and maxMergeDocs to Integer.MAX_VALUE, an entire
index will be built in a single segment completely in memory (using the
RAMDirectory) and then flushed to disk when closed. Given enough memory, it
would seem that this would be the fastest setting (as well as using a
minimum of file handles). Would you agree?

Thanks,
Scott

P.S. At one point I tried doing an in-memory index using the RAMDirectory
and then merging it with an on-disk index and it didn't work. The
RAMDirectory never flushed to disk... leaving me with an empty index. I
think this is because of a bug in the mechanism that is supposed to copy the
segments during the merge, but I didn't follow up on this.
RE: File Handles issue [ In reply to ]
> From: Scott Ganyo [mailto:scott.ganyo@eTapestry.com]
>
> Thanks for the detailed information, Doug! That helps a lot.
>
> Based on what you've said and on taking a closer look at the
> code, it looks
> like by setting mergeFactor and maxMergeDocs to
> Integer.MAX_VALUE, an entire
> index will be built in a single segment completely in memory
> (using the
> RAMDirectory) and then flushed to disk when closed.

Not quite. This would generate an index with a segment per document in
memory, and then try to merge them all in a single step. That should work,
but I do not think it is the most efficient way to build an index in memory.

> P.S. At one point I tried doing an in-memory index using the
> RAMDirectory
> and then merging it with an on-disk index and it didn't work. The
> RAMDirectory never flushed to disk... leaving me with an
> empty index. I
> think this is because of a bug in the mechanism that is
> supposed to copy the
> segments during the merge, but I didn't follow up on this.

That should work, it should be faster and would use a lot less memory than
the approach you describe above. Can you please submit a simple test case
illustrating the failure? Something self-contained would be best.

Doug
RE: File Handles issue [ In reply to ]
> > P.S. At one point I tried doing an in-memory index using the
> > RAMDirectory
> > and then merging it with an on-disk index and it didn't work. The
> > RAMDirectory never flushed to disk... leaving me with an
> > empty index. I
> > think this is because of a bug in the mechanism that is
> > supposed to copy the
> > segments during the merge, but I didn't follow up on this.
>
> That should work, it should be faster and would use a lot
> less memory than
> the approach you describe above. Can you please submit a
> simple test case
> illustrating the failure? Something self-contained would be best.

Ok. This will fail:

import java.io.*;
import org.apache.lucene.index.*;
import org.apache.lucene.analysis.*;
import org.apache.lucene.document.*;
import org.apache.lucene.store.*;

public class LuceneRAMDirectoryTest
{
public static void main(String args[])
{
try
{
// create index in RAM
RAMDirectory ramDirectory = new RAMDirectory();
Analyzer analyzer = new SimpleAnalyzer();
IndexWriter ramWriter = new IndexWriter(ramDirectory, analyzer,
true);
try
{
for (int i = 0; i < 100; i++)
{
Document doc = new Document();
doc.add(Field.Keyword("field1", "" + i));
ramWriter.addDocument(doc);
}
}
finally
{
ramWriter.close();
}

// then merge into file
File file = new File("index");
boolean missing = !file.exists();
if (missing) file.mkdir();
IndexWriter fileWriter = new IndexWriter(file, analyzer, true);
try
{
fileWriter.addIndexes(new Directory[]
{ ramDirectory });
}
finally
{
fileWriter.close();
}
}
catch (Exception e)
{
e.printStackTrace();
}
}
}