I was wondering if there are tricks for making indexing faster in
Lucene. I have a program which reads XML "documents" from a file, and
indexes the 7 or so fields which occur in them. Most of the fields are
very short, and the one long one averages a few hundred words.
To index 20000 such records takes 615 seconds. I use an IndexWriter with
a String as the first argument, i.e. indexing directly to disc. If I
change the mergeFactor to 100, the time drops to 275 seconds. At 1000,
it drops to 249s. These times are not bad in absolute terms, but the
20000 records represents only about 2% of my data, so indexing the whole
lot takes many hours. Using java -Xprof and mergeFactor=10, the biggest
consumers of processing time are:
22.2% 5 + 13172 java.io.RandomAccessFile.open
16.1% 4 + 9567 java.io.RandomAccessFile.close
13.3% 4 + 7880 java.io.RandomAccessFile.readBytes
8.1% 5 + 4818 java.io.RandomAccessFile.writeBytes
7.2% 4293 + 9
org.apache.lucene.analysis.standard.StandardTokenizerTokenManager.jjMove
Nfa_0
5.8% 5 + 3426 java.io.Win32FileSystem.delete
I believe all of these are calls from Lucene as I don't use any of the
above methods in my own code. readBytes and writeBytes I can believe,
but why so much time on open and close? Incidentally with
mergeFactor=1000, the biggest consumers are
29.7% 0 + 6729 java.io.RandomAccessFile.readBytes
19.0% 4296 + 12
org.apache.lucene.analysis.standard.StandardTokenizerTokenManager.jjMove
Nfa_0
As a point of comparison, I tried AltaVista's Java SDK (Nov 2000
release). I have a generic indexer program which differs only in the
specific indexing calls for AV and Lucene. For the same 20000 records,
it took only 57 seconds. This, I feel, does not speak well to Doug's
comment in the Lucene FAQ that indexing in Lucene is very fast. If
anyone has ideas for making it faster, I'd be interested to hear them.
-- David Elworthy
--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>
Lucene. I have a program which reads XML "documents" from a file, and
indexes the 7 or so fields which occur in them. Most of the fields are
very short, and the one long one averages a few hundred words.
To index 20000 such records takes 615 seconds. I use an IndexWriter with
a String as the first argument, i.e. indexing directly to disc. If I
change the mergeFactor to 100, the time drops to 275 seconds. At 1000,
it drops to 249s. These times are not bad in absolute terms, but the
20000 records represents only about 2% of my data, so indexing the whole
lot takes many hours. Using java -Xprof and mergeFactor=10, the biggest
consumers of processing time are:
22.2% 5 + 13172 java.io.RandomAccessFile.open
16.1% 4 + 9567 java.io.RandomAccessFile.close
13.3% 4 + 7880 java.io.RandomAccessFile.readBytes
8.1% 5 + 4818 java.io.RandomAccessFile.writeBytes
7.2% 4293 + 9
org.apache.lucene.analysis.standard.StandardTokenizerTokenManager.jjMove
Nfa_0
5.8% 5 + 3426 java.io.Win32FileSystem.delete
I believe all of these are calls from Lucene as I don't use any of the
above methods in my own code. readBytes and writeBytes I can believe,
but why so much time on open and close? Incidentally with
mergeFactor=1000, the biggest consumers are
29.7% 0 + 6729 java.io.RandomAccessFile.readBytes
19.0% 4296 + 12
org.apache.lucene.analysis.standard.StandardTokenizerTokenManager.jjMove
Nfa_0
As a point of comparison, I tried AltaVista's Java SDK (Nov 2000
release). I have a generic indexer program which differs only in the
specific indexing calls for AV and Lucene. For the same 20000 records,
it took only 57 seconds. This, I feel, does not speak well to Doug's
comment in the Lucene FAQ that indexing in Lucene is very fast. If
anyone has ideas for making it faster, I'd be interested to hear them.
-- David Elworthy
--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>