Mailing List Archive

Strange behavior indexing 1000 documents in RAMDirectory
While comparing RAMDirectory vs FSDirectory indexing performance I ran across some strange behavior. If I try to add 1000 documents using the RAMDirectory and then write it to disk, the search fails to find any documents. However, running the exact same test with any other number of documents seems to work fine.

I have attached a test case that attempts to add/search 500, 999, 1000, 1001, 2000, 5000 documents to both a RAMDirectory and FSDirectory and then searches for the documents. Notice in the following test output that the RAMDirectory test using 1000 documents fails to find any documents, all other tests work fine.

D:\Dev\Test Tools\lucene>java -cp %CLASSPATH%;.\ RAMWriterTest

Index and search 500 documents
RAMDirectory: indexed 500 in 1656 msec
RAMDirectory indexing: search 500 in 125 msec
FSDirectory: indexed 500 in 4157 msec
FSDirectory indexing: search 500 in 78 msec

Index and search 999 documents
RAMDirectory: indexed 999 in 1953 msec
RAMDirectory indexing: search 999 in 31 msec
FSDirectory: indexed 999 in 7797 msec
FSDirectory indexing: search 999 in 235 msec

Index and search 1000 documents
RAMDirectory: indexed 1000 in 1891 msec
RAMDirectory indexing: search 0 in 15 msec
FSDirectory: indexed 1000 in 8485 msec
FSDirectory indexing: search 1000 in 31 msec

Index and search 1001 documents
RAMDirectory: indexed 1001 in 1797 msec
RAMDirectory indexing: search 1001 in 16 msec
FSDirectory: indexed 1001 in 8484 msec
FSDirectory indexing: search 1001 in 32 msec

Index and search 2000 documents
RAMDirectory: indexed 2000 in 3594 msec
RAMDirectory indexing: search 2000 in 16 msec
FSDirectory: indexed 2000 in 16875 msec
FSDirectory indexing: search 2000 in 31 msec

Index and search 5000 documents
RAMDirectory: indexed 5000 in 8641 msec
RAMDirectory indexing: search 5000 in 31 msec
FSDirectory: indexed 5000 in 42375 msec
FSDirectory indexing: search 5000 in 78 msec


Can anybody explain this?

Thanks.
Paul
RE: Strange behavior indexing 1000 documents in RAMDirectory [ In reply to ]
Paul,

Thanks for the nice test case!

This bug was fixed a week or so ago. Try the latest nightly release from:
http://jakarta.apache.org/builds/jakarta-lucene/nightly/

Using that, I get the desired output:

Index and search 1000 documents
RAMDirectory: indexed 1000 in 1532 msec
RAMDirectory indexing: search 1000 in 0 msec
FSDirectory: indexed 1000 in 8622 msec
FSDirectory indexing: search 1000 in 0 msec

RAMDirectory is sure a lot faster! Looks like I should add an option to let
more of indexing automatically happen in a RAMDirectory...

Currently mergeFactor documents are indexed in RAM and then merged to disk.
It would be fairly easy to add a limit so that, up to N documents could be
indexed in RAM before any are written to disk, where N is user-specified.
IndexWriter.close() would still flush RAM-based segments to disk. The
default should still probably be fairly low, in case folks are adding large
documents and don't have much RAM, but folks with RAM and small documents
could raise it.

A better approach would be to have users specify the limit in bytes rather
than documents, and to flush the RAM-based segments when the RAM directory's
size reaches that limit. This would take a bit more work, but still
shouldn't be hard. Then you could dedicate, say, 10MB to indexing,
regardless of document size. Hmm...

Doug

--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>
Re: Strange behavior indexing 1000 documents in RAMDirectory [ In reply to ]
I was wondering if there is more documentation surrounding RAMDirectory?
There is essentially no documentation in the javadoc for the Directory
classes or how to apply them effectively in your application.

Does it make sense to index a large collection of documents using a
temporary RAMDirectory and then merging with primary index directory? Does
anyone have code examples they can share to illustrate? I've only been
developing using Lucene for a little while now...

thanks,
jeff


----- Original Message -----
From: "Paul Friedman" <pfriedman@macromedia.com>
To: <lucene-user@jakarta.apache.org>
Sent: Friday, November 09, 2001 5:26 PM
Subject: Strange behavior indexing 1000 documents in RAMDirectory


> While comparing RAMDirectory vs FSDirectory indexing performance I ran
across some strange behavior. If I try to add 1000 documents using the
RAMDirectory and then write it to disk, the search fails to find any
documents. However, running the exact same test with any other number of
documents seems to work fine.
>
> I have attached a test case that attempts to add/search 500, 999, 1000,
1001, 2000, 5000 documents to both a RAMDirectory and FSDirectory and then
searches for the documents. Notice in the following test output that the
RAMDirectory test using 1000 documents fails to find any documents, all
other tests work fine.
>
> D:\Dev\Test Tools\lucene>java -cp %CLASSPATH%;.\ RAMWriterTest
>
> Index and search 500 documents
> RAMDirectory: indexed 500 in 1656 msec
> RAMDirectory indexing: search 500 in 125 msec
> FSDirectory: indexed 500 in 4157 msec
> FSDirectory indexing: search 500 in 78 msec
>
> Index and search 999 documents
> RAMDirectory: indexed 999 in 1953 msec
> RAMDirectory indexing: search 999 in 31 msec
> FSDirectory: indexed 999 in 7797 msec
> FSDirectory indexing: search 999 in 235 msec
>
> Index and search 1000 documents
> RAMDirectory: indexed 1000 in 1891 msec
> RAMDirectory indexing: search 0 in 15 msec
> FSDirectory: indexed 1000 in 8485 msec
> FSDirectory indexing: search 1000 in 31 msec
>
> Index and search 1001 documents
> RAMDirectory: indexed 1001 in 1797 msec
> RAMDirectory indexing: search 1001 in 16 msec
> FSDirectory: indexed 1001 in 8484 msec
> FSDirectory indexing: search 1001 in 32 msec
>
> Index and search 2000 documents
> RAMDirectory: indexed 2000 in 3594 msec
> RAMDirectory indexing: search 2000 in 16 msec
> FSDirectory: indexed 2000 in 16875 msec
> FSDirectory indexing: search 2000 in 31 msec
>
> Index and search 5000 documents
> RAMDirectory: indexed 5000 in 8641 msec
> RAMDirectory indexing: search 5000 in 31 msec
> FSDirectory: indexed 5000 in 42375 msec
> FSDirectory indexing: search 5000 in 78 msec
>
>
> Can anybody explain this?
>
> Thanks.
> Paul
>
>


----------------------------------------------------------------------------
----


> --
> To unsubscribe, e-mail:
<mailto:lucene-user-unsubscribe@jakarta.apache.org>
> For additional commands, e-mail:
<mailto:lucene-user-help@jakarta.apache.org>


--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>
RE: Strange behavior indexing 1000 documents in RAMDirectory [ In reply to ]
Doug,

How stable are the nightly builds? When do you expect a final release of v.1.2?

Thanks.
Paul

-----Original Message-----
From: Doug Cutting [mailto:DCutting@grandcentral.com]
Sent: Friday, November 09, 2001 6:02 PM
To: 'Lucene Users List'
Subject: RE: Strange behavior indexing 1000 documents in RAMDirectory


Paul,

Thanks for the nice test case!

This bug was fixed a week or so ago. Try the latest nightly release from:
http://jakarta.apache.org/builds/jakarta-lucene/nightly/

Using that, I get the desired output:

Index and search 1000 documents
RAMDirectory: indexed 1000 in 1532 msec
RAMDirectory indexing: search 1000 in 0 msec
FSDirectory: indexed 1000 in 8622 msec
FSDirectory indexing: search 1000 in 0 msec

RAMDirectory is sure a lot faster! Looks like I should add an option to let
more of indexing automatically happen in a RAMDirectory...

Currently mergeFactor documents are indexed in RAM and then merged to disk.
It would be fairly easy to add a limit so that, up to N documents could be
indexed in RAM before any are written to disk, where N is user-specified.
IndexWriter.close() would still flush RAM-based segments to disk. The
default should still probably be fairly low, in case folks are adding large
documents and don't have much RAM, but folks with RAM and small documents
could raise it.

A better approach would be to have users specify the limit in bytes rather
than documents, and to flush the RAM-based segments when the RAM directory's
size reaches that limit. This would take a bit more work, but still
shouldn't be hard. Then you could dedicate, say, 10MB to indexing,
regardless of document size. Hmm...

Doug

--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>

--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>