Mailing List Archive: Speed of indexing

I was wondering if there are tricks for making indexing faster in
Lucene. I have a program which reads XML "documents" from a file, and
indexes the 7 or so fields which occur in them. Most of the fields are
very short, and the one long one averages a few hundred words.

To index 20000 such records takes 615 seconds. I use an IndexWriter with
a String as the first argument, i.e. indexing directly to disc. If I
change the mergeFactor to 100, the time drops to 275 seconds. At 1000,
it drops to 249s. These times are not bad in absolute terms, but the
20000 records represents only about 2% of my data, so indexing the whole
lot takes many hours. Using java -Xprof and mergeFactor=10, the biggest
consumers of processing time are:
22.2% 5 + 13172 java.io.RandomAccessFile.open
16.1% 4 + 9567 java.io.RandomAccessFile.close
13.3% 4 + 7880 java.io.RandomAccessFile.readBytes
8.1% 5 + 4818 java.io.RandomAccessFile.writeBytes
7.2% 4293 + 9
org.apache.lucene.analysis.standard.StandardTokenizerTokenManager.jjMove
Nfa_0
5.8% 5 + 3426 java.io.Win32FileSystem.delete

I believe all of these are calls from Lucene as I don't use any of the
above methods in my own code. readBytes and writeBytes I can believe,
but why so much time on open and close? Incidentally with
mergeFactor=1000, the biggest consumers are
29.7% 0 + 6729 java.io.RandomAccessFile.readBytes
19.0% 4296 + 12
org.apache.lucene.analysis.standard.StandardTokenizerTokenManager.jjMove
Nfa_0

As a point of comparison, I tried AltaVista's Java SDK (Nov 2000
release). I have a generic indexer program which differs only in the
specific indexing calls for AV and Lucene. For the same 20000 records,
it took only 57 seconds. This, I feel, does not speak well to Doug's
comment in the Lucene FAQ that indexing in Lucene is very fast. If
anyone has ideas for making it faster, I'd be interested to hear them.

-- David Elworthy

--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>

my experiences are that the writing to the index takes the most time except
any parsing done by the user. I have been working on xml indexes and here the
collection of data takes just as much time as to write. to increase speed i
have done three things that reduced my index time from 11hours to 2,5 hours
for the same dataset (1,3gb xml documents).

1: i index 50 documents into a ramdir, then when the limit is reached i merge
this ramdir into a fsdir and flush the ramdir. this speeds up things
as i then don't have to use the fsdir as much and ramdir is much faster.

2: merging a large index into a large index takes nearly as much time as
merging a small index into a large index, so i have 4 (any number will do)
fsdirs that i write ramdirs to and then i merge these fsdirs into one large
fsdir at the end of a large indexrun.

3: multithreaded my application, create workerthreads that indexes into its
own sepparate ramdir, then flushes these ramdirs into each separate fsdir
(hench i have a fsdir for each workerthread), this because you can only write
to a dir by one thread.

in the end this imporved my indexing time a lot...

hope some of this can help you!

mvh karl øie

On Monday 25 March 2002 14:08, you wrote:
> I was wondering if there are tricks for making indexing faster in
> Lucene. I have a program which reads XML "documents" from a file, and
> indexes the 7 or so fields which occur in them. Most of the fields are
> very short, and the one long one averages a few hundred words.
>
> To index 20000 such records takes 615 seconds. I use an IndexWriter with
> a String as the first argument, i.e. indexing directly to disc. If I
> change the mergeFactor to 100, the time drops to 275 seconds. At 1000,
> it drops to 249s. These times are not bad in absolute terms, but the
> 20000 records represents only about 2% of my data, so indexing the whole
> lot takes many hours. Using java -Xprof and mergeFactor=10, the biggest
> consumers of processing time are:
> 22.2% 5 + 13172 java.io.RandomAccessFile.open
> 16.1% 4 + 9567 java.io.RandomAccessFile.close
> 13.3% 4 + 7880 java.io.RandomAccessFile.readBytes
> 8.1% 5 + 4818 java.io.RandomAccessFile.writeBytes
> 7.2% 4293 + 9
> org.apache.lucene.analysis.standard.StandardTokenizerTokenManager.jjMove
> Nfa_0
> 5.8% 5 + 3426 java.io.Win32FileSystem.delete
>
> I believe all of these are calls from Lucene as I don't use any of the
> above methods in my own code. readBytes and writeBytes I can believe,
> but why so much time on open and close? Incidentally with
> mergeFactor=1000, the biggest consumers are
> 29.7% 0 + 6729 java.io.RandomAccessFile.readBytes
> 19.0% 4296 + 12
> org.apache.lucene.analysis.standard.StandardTokenizerTokenManager.jjMove
> Nfa_0
>
>
> As a point of comparison, I tried AltaVista's Java SDK (Nov 2000
> release). I have a generic indexer program which differs only in the
> specific indexing calls for AV and Lucene. For the same 20000 records,
> it took only 57 seconds. This, I feel, does not speak well to Doug's
> comment in the Lucene FAQ that indexing in Lucene is very fast. If
> anyone has ideas for making it faster, I'd be interested to hear them.
>
> -- David Elworthy

--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>