Mailing List Archive

cvs commit: jakarta-lucene/xdocs benchmarks.xml
otis 2002/12/11 22:23:48

Modified: docs benchmarks.html contributions.html demo.html
demo2.html demo3.html demo4.html fileformats.html
gettingstarted.html index.html luceneplan.html
powered.html queryparsersyntax.html resources.html
todo.html whoweare.html
docs/lucene-sandbox index.html
docs/lucene-sandbox/indyo tutorial.html
docs/lucene-sandbox/larm overview.html
xdocs benchmarks.xml
Log:
- Modified docs.

Revision Changes Path
1.3 +324 -248 jakarta-lucene/docs/benchmarks.html

Index: benchmarks.html
===================================================================
RCS file: /home/cvs/jakarta-lucene/docs/benchmarks.html,v
retrieving revision 1.2
retrieving revision 1.3
diff -u -r1.2 -r1.3
--- benchmarks.html 4 Dec 2002 05:56:32 -0000 1.2
+++ benchmarks.html 12 Dec 2002 06:23:47 -0000 1.3
@@ -5,6 +5,7 @@

<!-- start the processing -->
<!-- ====================================================================== -->
+ <!-- GENERATED FILE, DO NOT EDIT, EDIT THE XML FILE IN xdocs INSTEAD! -->
<!-- Main Page Section -->
<!-- ====================================================================== -->
<html>
@@ -121,20 +122,20 @@
<tr><td>
<blockquote>
<p>
- The purpose of these user-submitted performance figures is to
-give current and potential users of Lucene a sense
- of how well Lucene scales. If the requirements for an upcoming
-project is similar to an existing benchmark, you
- will also have something to work with when designing the system
-architecture for the application.
- </p>
+ The purpose of these user-submitted performance figures is to
+ give current and potential users of Lucene a sense
+ of how well Lucene scales. If the requirements for an upcoming
+ project is similar to an existing benchmark, you
+ will also have something to work with when designing the system
+ architecture for the application.
+ </p>
<p>
- If you've conducted performance tests with Lucene, we'd
-appreciate if you can submit these figures for display
- on this page. Post these figures to the lucene-user mailing list
-using this
- <a href="benchmarktemplate.xml">template</a>.
- </p>
+ If you've conducted performance tests with Lucene, we'd
+ appreciate if you can submit these figures for display
+ on this page. Post these figures to the lucene-user mailing list
+ using this
+ <a href="benchmarktemplate.xml">template</a>.
+ </p>
</blockquote>
</p>
</td></tr>
@@ -149,64 +150,64 @@
<tr><td>
<blockquote>
<p>
- <ul>
- <p>
- <b>Hardware Environment</b><br />
- <li><i>Dedicated machine for indexing</i>: Self-explanatory
-(yes/no)</li>
- <li><i>CPU</i>: Self-explanatory (Type, Speed and Quantity)</li>
- <li><i>RAM</i>: Self-explanatory</li>
- <li><i>Drive configuration</i>: Self-explanatory (IDE, SCSI,
-RAID-1, RAID-5)</li>
- </p>
- <p>
- <b>Software environment</b><br />
- <li><i>Java Version</i>: Version of Java SDK/JRE that is run
-</li>
- <li><i>Java VM</i>: Server/client VM, Sun VM/JRockIt</li>
- <li><i>OS Version</i>: Self-explanatory</li>
- <li><i>Location of index</i>: Is the index stored in filesystem
-or database? Is it on the same server(local) or
- over the network?</li>
- </p>
- <p>
- <b>Lucene indexing variables</b><br />
- <li><i>Number of source documents</i>: Number of documents being
-indexed</li>
- <li><i>Total filesize of source documents</i>:
-Self-explanatory</li>
- <li><i>Average filesize of source documents</i>:
-Self-explanatory</li>
- <li><i>Source documents storage location</i>: Where are the
-documents being indexed located?
- Filesystem, DB, http,etc</li>
- <li><i>File type of source documents</i>: Types of files being
-indexed, e.g. HTML files, XML files, PDF files, etc.</li>
- <li><i>Parser(s) used, if any</i>: Parsers used for parsing the
-various files for indexing,
- e.g. XML parser, HTML parser, etc.</li>
- <li><i>Analyzer(s) used</i>: Type of Lucene analyzer used</li>
- <li><i>Number of fields per document</i>: Number of Fields each
-Document contains</li>
- <li><i>Type of fields</i>: Type of each field</li>
- <li><i>Index persistence</i>: Where the index is stored, e.g.
-FSDirectory, SqlDirectory, etc</li>
- </p>
- <p>
- <b>Figures</b><br />
- <li><i>Time taken (in ms/s as an average of at least 3 indexing
-runs)</i>: Time taken to index all files</li>
- <li><i>Time taken / 1000 docs indexed</i>: Time taken to index
-1000 files</li>
- <li><i>Memory consumption</i>: Self-explanatory</li>
- </p>
- <p>
- <b>Notes</b><br />
- <li><i>Notes</i>: Any comments which don't belong in the above,
-special tuning/strategies, etc</li>
- </p>
- </ul>
- </p>
+ <ul>
+ <p>
+ <b>Hardware Environment</b><br />
+ <li><i>Dedicated machine for indexing</i>: Self-explanatory
+ (yes/no)</li>
+ <li><i>CPU</i>: Self-explanatory (Type, Speed and Quantity)</li>
+ <li><i>RAM</i>: Self-explanatory</li>
+ <li><i>Drive configuration</i>: Self-explanatory (IDE, SCSI,
+ RAID-1, RAID-5)</li>
+ </p>
+ <p>
+ <b>Software environment</b><br />
+ <li><i>Java Version</i>: Version of Java SDK/JRE that is run
+ </li>
+ <li><i>Java VM</i>: Server/client VM, Sun VM/JRockIt</li>
+ <li><i>OS Version</i>: Self-explanatory</li>
+ <li><i>Location of index</i>: Is the index stored in filesystem
+ or database? Is it on the same server(local) or
+ over the network?</li>
+ </p>
+ <p>
+ <b>Lucene indexing variables</b><br />
+ <li><i>Number of source documents</i>: Number of documents being
+ indexed</li>
+ <li><i>Total filesize of source documents</i>:
+ Self-explanatory</li>
+ <li><i>Average filesize of source documents</i>:
+ Self-explanatory</li>
+ <li><i>Source documents storage location</i>: Where are the
+ documents being indexed located?
+ Filesystem, DB, http,etc</li>
+ <li><i>File type of source documents</i>: Types of files being
+ indexed, e.g. HTML files, XML files, PDF files, etc.</li>
+ <li><i>Parser(s) used, if any</i>: Parsers used for parsing the
+ various files for indexing,
+ e.g. XML parser, HTML parser, etc.</li>
+ <li><i>Analyzer(s) used</i>: Type of Lucene analyzer used</li>
+ <li><i>Number of fields per document</i>: Number of Fields each
+ Document contains</li>
+ <li><i>Type of fields</i>: Type of each field</li>
+ <li><i>Index persistence</i>: Where the index is stored, e.g.
+ FSDirectory, SqlDirectory, etc</li>
+ </p>
+ <p>
+ <b>Figures</b><br />
+ <li><i>Time taken (in ms/s as an average of at least 3 indexing
+ runs)</i>: Time taken to index all files</li>
+ <li><i>Time taken / 1000 docs indexed</i>: Time taken to index
+ 1000 files</li>
+ <li><i>Memory consumption</i>: Self-explanatory</li>
+ </p>
+ <p>
+ <b>Notes</b><br />
+ <li><i>Notes</i>: Any comments which don't belong in the above,
+ special tuning/strategies, etc</li>
+ </p>
+ </ul>
+ </p>
</blockquote>
</p>
</td></tr>
@@ -221,17 +222,17 @@
<tr><td>
<blockquote>
<p>
- These benchmarks have been kindly submitted by Lucene users for
-reference purposes.
- </p>
- <p><b>We make NO guarantees regarding their accuracy or
-validity.</b>
- </p>
- <p>We strongly recommend you conduct your own
- performance benchmarks before deciding on a particular
-hardware/software setup (and hopefully submit
- these figures to us).
- </p>
+ These benchmarks have been kindly submitted by Lucene users for
+ reference purposes.
+ </p>
+ <p><b>We make NO guarantees regarding their accuracy or
+ validity.</b>
+ </p>
+ <p>We strongly recommend you conduct your own
+ performance benchmarks before deciding on a particular
+ hardware/software setup (and hopefully submit
+ these figures to us).
+ </p>
<table border="0" cellspacing="0" cellpadding="2" width="100%">
<tr><td bgcolor="#828DA6">
<font color="#ffffff" face="arial,helvetica,sanserif">
@@ -241,109 +242,109 @@
<tr><td>
<blockquote>
<ul>
- <p>
- <b>Hardware Environment</b><br />
- <li><i>Dedicated machine for indexing</i>: yes</li>
- <li><i>CPU</i>: Intel x86 P4 1.5Ghz</li>
- <li><i>RAM</i>: 512 DDR</li>
- <li><i>Drive configuration</i>: IDE 7200rpm Raid-1</li>
- </p>
- <p>
- <b>Software environment</b><br />
- <li><i>Java Version</i>: 1.3.1 IBM JITC Enabled</li>
- <li><i>Java VM</i>: </li>
- <li><i>OS Version</i>: Debian Linux 2.4.18-686</li>
- <li><i>Location of index</i>: local</li>
- </p>
- <p>
- <b>Lucene indexing variables</b><br />
- <li><i>Number of source documents</i>: Random generator. Set
-to make 1M documents
-in 2x500,000 batches.</li>
- <li><i>Total filesize of source documents</i>: &gt; 1GB if
-stored</li>
- <li><i>Average filesize of source documents</i>: 1KB</li>
- <li><i>Source documents storage location</i>: Filesystem</li>
- <li><i>File type of source documents</i>: Generated</li>
- <li><i>Parser(s) used, if any</i>: </li>
- <li><i>Analyzer(s) used</i>: Default</li>
- <li><i>Number of fields per document</i>: 11</li>
- <li><i>Type of fields</i>: 1 date, 1 id, 9 text</li>
- <li><i>Index persistence</i>: FSDirectory</li>
- </p>
- <p>
- <b>Figures</b><br />
- <li><i>Time taken (in ms/s as an average of at least 3
-indexing runs)</i>: </li>
- <li><i>Time taken / 1000 docs indexed</i>: 49 seconds</li>
- <li><i>Memory consumption</i>:</li>
- </p>
- <p>
- <b>Notes</b><br />
- <li><i>Notes</i>:
- <p>
- A windows client ran a random document generator which
-created
- documents based on some arrays of values and an excerpt
-(approx 1kb)
- from a text file of the bible (King James version).<br />
- These were submitted via a socket connection (open throughout
- indexing process).<br />
- The index writer was not closed between index calls.<br />
- This created a 400Mb index in 23 files (after
-optimization).<br />
- </p>
- <p>
- <u>Query details</u>:<br />
- </p>
- <p>
- Set up a threaded class to start x number of simultaneous
-threads to
- search the above created index.
- </p>
- <p>
- Query: +Domain:sos +(+((Name:goo*^2.0 Name:plan*^2.0)
-(Teaser:goo* Tea
- ser:plan*) (Details:goo* Details:plan*)) -Cancel:y)
- +DisplayStartDate:[mkwsw2jk0
- -mq3dj1uq0] +EndDate:[mq3dj1uq0-ntlxuggw0]
- </p>
- <p>
- This query counted 34000 documents and I limited the returned
-documents
- to 5.
- </p>
- <p>
- This is using Peter Halacsy's IndexSearcherCache slightly
-modified to
- be a singleton returned cached searchers for a given
-directory. This
- solved an initial problem with too many files open and
-running out of
- linux handles for them.
- </p>
- <pre>
- Threads|Avg Time per query (ms)
- 1 1009ms
- 2 2043ms
- 3 3087ms
- 4 4045ms
- .. .
- .. .
- 10 10091ms
- </pre>
- <p>
- I removed the two date range terms from the query and it made
-a HUGE
- difference in performance. With 4 threads the avg time
-dropped to 900ms!
- </p>
- <p>Other query optimizations made little difference.</p></li>
- </p>
- </ul>
+ <p>
+ <b>Hardware Environment</b><br />
+ <li><i>Dedicated machine for indexing</i>: yes</li>
+ <li><i>CPU</i>: Intel x86 P4 1.5Ghz</li>
+ <li><i>RAM</i>: 512 DDR</li>
+ <li><i>Drive configuration</i>: IDE 7200rpm Raid-1</li>
+ </p>
+ <p>
+ <b>Software environment</b><br />
+ <li><i>Java Version</i>: 1.3.1 IBM JITC Enabled</li>
+ <li><i>Java VM</i>: </li>
+ <li><i>OS Version</i>: Debian Linux 2.4.18-686</li>
+ <li><i>Location of index</i>: local</li>
+ </p>
+ <p>
+ <b>Lucene indexing variables</b><br />
+ <li><i>Number of source documents</i>: Random generator. Set
+ to make 1M documents
+ in 2x500,000 batches.</li>
+ <li><i>Total filesize of source documents</i>: &gt; 1GB if
+ stored</li>
+ <li><i>Average filesize of source documents</i>: 1KB</li>
+ <li><i>Source documents storage location</i>: Filesystem</li>
+ <li><i>File type of source documents</i>: Generated</li>
+ <li><i>Parser(s) used, if any</i>: </li>
+ <li><i>Analyzer(s) used</i>: Default</li>
+ <li><i>Number of fields per document</i>: 11</li>
+ <li><i>Type of fields</i>: 1 date, 1 id, 9 text</li>
+ <li><i>Index persistence</i>: FSDirectory</li>
+ </p>
+ <p>
+ <b>Figures</b><br />
+ <li><i>Time taken (in ms/s as an average of at least 3
+ indexing runs)</i>: </li>
+ <li><i>Time taken / 1000 docs indexed</i>: 49 seconds</li>
+ <li><i>Memory consumption</i>:</li>
+ </p>
+ <p>
+ <b>Notes</b><br />
+ <li><i>Notes</i>:
+ <p>
+ A windows client ran a random document generator which
+ created
+ documents based on some arrays of values and an excerpt
+ (approx 1kb)
+ from a text file of the bible (King James version).<br />
+ These were submitted via a socket connection (open throughout
+ indexing process).<br />
+ The index writer was not closed between index calls.<br />
+ This created a 400Mb index in 23 files (after
+ optimization).<br />
+ </p>
+ <p>
+ <u>Query details</u>:<br />
+ </p>
+ <p>
+ Set up a threaded class to start x number of simultaneous
+ threads to
+ search the above created index.
+ </p>
+ <p>
+ Query: +Domain:sos +(+((Name:goo*^2.0 Name:plan*^2.0)
+ (Teaser:goo* Tea
+ ser:plan*) (Details:goo* Details:plan*)) -Cancel:y)
+ +DisplayStartDate:[mkwsw2jk0
+ -mq3dj1uq0] +EndDate:[mq3dj1uq0-ntlxuggw0]
+ </p>
+ <p>
+ This query counted 34000 documents and I limited the returned
+ documents
+ to 5.
+ </p>
+ <p>
+ This is using Peter Halacsy's IndexSearcherCache slightly
+ modified to
+ be a singleton returned cached searchers for a given
+ directory. This
+ solved an initial problem with too many files open and
+ running out of
+ linux handles for them.
+ </p>
+ <pre>
+ Threads|Avg Time per query (ms)
+ 1 1009ms
+ 2 2043ms
+ 3 3087ms
+ 4 4045ms
+ .. .
+ .. .
+ 10 10091ms
+ </pre>
+ <p>
+ I removed the two date range terms from the query and it made
+ a HUGE
+ difference in performance. With 4 threads the avg time
+ dropped to 900ms!
+ </p>
+ <p>Other query optimizations made little difference.</p></li>
+ </p>
+ </ul>
<p>
- Hamish can be contacted at hamish at catalyst.net.nz.
- </p>
+ Hamish can be contacted at hamish at catalyst.net.nz.
+ </p>
</blockquote>
</td></tr>
<tr><td><br/></td></tr>
@@ -357,71 +358,146 @@
<tr><td>
<blockquote>
<ul>
- <p>
- <b>Hardware Environment</b><br />
- <li><i>Dedicated machine for indexing</i>: No, but nominal
-usage at time of indexing.</li>
- <li><i>CPU</i>: Compaq Proliant 1850R/600 2 X pIII 600</li>
- <li><i>RAM</i>: 1GB, 256MB allocated to JVM.</li>
- <li><i>Drive configuration</i>: RAID 5 on Fibre Channel
-Array</li>
- </p>
- <p>
- <b>Software environment</b><br />
- <li><i>Java Version</i>: 1.3.1_06</li>
- <li><i>Java VM</i>: </li>
- <li><i>OS Version</i>: Winnt 4/Sp6</li>
- <li><i>Location of index</i>: local</li>
- </p>
- <p>
- <b>Lucene indexing variables</b><br />
- <li><i>Number of source documents</i>: about 60K</li>
- <li><i>Total filesize of source documents</i>: 6.5GB</li>
- <li><i>Average filesize of source documents</i>: 100K
-(6.5GB/60K documents)</li>
- <li><i>Source documents storage location</i>: filesystem on
-NTFS</li>
- <li><i>File type of source documents</i>: </li>
- <li><i>Parser(s) used, if any</i>: Currently the only parser
-used is the Quiotix html
- parser.</li>
- <li><i>Analyzer(s) used</i>: SimpleAnalyzer</li>
- <li><i>Number of fields per document</i>: 8</li>
- <li><i>Type of fields</i>: All strings, and all are stored
-and indexed.</li>
- <li><i>Index persistence</i>: FSDirectory</li>
- </p>
- <p>
- <b>Figures</b><br />
- <li><i>Time taken (in ms/s as an average of at least 3
-indexing runs)</i>: 1 hour 12 minutes, 1 hour 14 minutes and 1 hour 17
-minutes. Note that the #
- and size of documents changes daily.</li>
- <li><i>Time taken / 1000 docs indexed</i>: </li>
- <li><i>Memory consumption</i>: JVM is given 256MB and uses it
-all.</li>
- </p>
- <p>
- <b>Notes</b><br />
- <li><i>Notes</i>:
- <p>
- We have 10 threads reading files from the filesystem and
-parsing and
- analyzing them and the pushing them onto a queue and a single
-thread poping
- them from the queue and indexing. Note that we are indexing
-email messages
- and are storing the entire plaintext in of the message in the
-index. If the
- message contains attachment and we do not have a filter for
-the attachment
- (ie. we do not do PDFs yet), we discard the data.
- </p></li>
- </p>
- </ul>
+ <p>
+ <b>Hardware Environment</b><br />
+ <li><i>Dedicated machine for indexing</i>: No, but nominal
+ usage at time of indexing.</li>
+ <li><i>CPU</i>: Compaq Proliant 1850R/600 2 X pIII 600</li>
+ <li><i>RAM</i>: 1GB, 256MB allocated to JVM.</li>
+ <li><i>Drive configuration</i>: RAID 5 on Fibre Channel
+ Array</li>
+ </p>
+ <p>
+ <b>Software environment</b><br />
+ <li><i>Java Version</i>: 1.3.1_06</li>
+ <li><i>Java VM</i>: </li>
+ <li><i>OS Version</i>: Winnt 4/Sp6</li>
+ <li><i>Location of index</i>: local</li>
+ </p>
+ <p>
+ <b>Lucene indexing variables</b><br />
+ <li><i>Number of source documents</i>: about 60K</li>
+ <li><i>Total filesize of source documents</i>: 6.5GB</li>
+ <li><i>Average filesize of source documents</i>: 100K
+ (6.5GB/60K documents)</li>
+ <li><i>Source documents storage location</i>: filesystem on
+ NTFS</li>
+ <li><i>File type of source documents</i>: </li>
+ <li><i>Parser(s) used, if any</i>: Currently the only parser
+ used is the Quiotix html
+ parser.</li>
+ <li><i>Analyzer(s) used</i>: SimpleAnalyzer</li>
+ <li><i>Number of fields per document</i>: 8</li>
+ <li><i>Type of fields</i>: All strings, and all are stored
+ and indexed.</li>
+ <li><i>Index persistence</i>: FSDirectory</li>
+ </p>
+ <p>
+ <b>Figures</b><br />
+ <li><i>Time taken (in ms/s as an average of at least 3
+ indexing runs)</i>: 1 hour 12 minutes, 1 hour 14 minutes and 1 hour 17
+ minutes. Note that the #
+ and size of documents changes daily.</li>
+ <li><i>Time taken / 1000 docs indexed</i>: </li>
+ <li><i>Memory consumption</i>: JVM is given 256MB and uses it
+ all.</li>
+ </p>
+ <p>
+ <b>Notes</b><br />
+ <li><i>Notes</i>:
+ <p>
+ We have 10 threads reading files from the filesystem and
+ parsing and
+ analyzing them and the pushing them onto a queue and a single
+ thread poping
+ them from the queue and indexing. Note that we are indexing
+ email messages
+ and are storing the entire plaintext in of the message in the
+ index. If the
+ message contains attachment and we do not have a filter for
+ the attachment
+ (ie. we do not do PDFs yet), we discard the data.
+ </p></li>
+ </p>
+ </ul>
+ <p>
+ Justin can be contacted at tvxh-lw4x at spamex.com.
+ </p>
+ </blockquote>
+ </td></tr>
+ <tr><td><br/></td></tr>
+ </table>
+ <table border="0" cellspacing="0" cellpadding="2" width="100%">
+ <tr><td bgcolor="#828DA6">
+ <font color="#ffffff" face="arial,helvetica,sanserif">
+ <a name="Daniel Armbrust's benchmarks"><strong>Daniel Armbrust's benchmarks</strong></a>
+ </font>
+ </td></tr>
+ <tr><td>
+ <blockquote>
+ <p>
+ My disclaimer is that this is a very poor "Benchmark". It was not done for raw speed,
+ nor was the total index built in one shot. The index was created on several different
+ machines (all with these specs, or very similar), with each machine indexing batches of 500,000 to
+ 1 million documents per batch. Each of these small indexes was then moved to a
+ much larger drive, where they were all merged together into a big index.
+ This process was done manually, over the course of several months, as the sources became available.
+ </p>
+ <ul>
+ <p>
+ <b>Hardware Environment</b><br />
+ <li><i>Dedicated machine for indexing</i>: no - The machine had moderate to low load. However, the indexing process was built single
+ threaded, so it only took advantage of 1 of the processors. It usually got 100% of this processor.</li>
+ <li><i>CPU</i>: Sun Ultra 80 4 x 64 bit processors</li>
+ <li><i>RAM</i>: 4 GB Memory</li>
+ <li><i>Drive configuration</i>: Ultra-SCSI Wide 10000 RPM 36GB Drive</li>
+ </p>
+ <p>
+ <b>Software environment</b><br />
+ <li><i>Java Version</i>: 1.3.1</li>
+ <li><i>Java VM</i>: </li>
+ <li><i>OS Version</i>: Sun 5.8 (64 bit)</li>
+ <li><i>Location of index</i>: local</li>
+ </p>
+ <p>
+ <b>Lucene indexing variables</b><br />
+ <li><i>Number of source documents</i>: 13,820,517</li>
+ <li><i>Total filesize of source documents</i>: 87.3 GB</li>
+ <li><i>Average filesize of source documents</i>: 6.3 KB</li>
+ <li><i>Source documents storage location</i>: Filesystem</li>
+ <li><i>File type of source documents</i>: XML</li>
+ <li><i>Parser(s) used, if any</i>: </li>
+ <li><i>Analyzer(s) used</i>: A home grown analyzer that simply removes stopwords.</li>
+ <li><i>Number of fields per document</i>: 1 - 31</li>
+ <li><i>Type of fields</i>: All text, though 2 of them are dates (20001205) that we filter on</li>
+ <li><i>Index persistence</i>: FSDirectory</li>
+ <li><i>Index size</i>: 12.5 GB</li>
+ </p>
+ <p>
+ <b>Figures</b><br />
+ <li><i>Time taken (in ms/s as an average of at least 3
+ indexing runs)</i>: For 617271 documents, 209698 seconds (or ~2.5 days)</li>
+ <li><i>Time taken / 1000 docs indexed</i>: 340 Seconds</li>
+ <li><i>Memory consumption</i>: (java executed with) java -Xmx1000m -Xss8192k so
+ 1 GB of memory was allotted to the indexer</li>
+ </p>
+ <p>
+ <b>Notes</b><br />
+ <li><i>Notes</i>:
+ <p>
+ The source documents were XML. The "indexer" opened each document one at a time, ran an
+ XSL transformation on them, and then proceeded to index the stream. The indexer optimized
+ the index every 50,000 documents (on this run) though previously, we optimized every
+ 300,000 documents. The performance didn't change much either way. We did no other
+ tuning (RAM Directories, separate process to pretransform the source material, etc)
+ to make it index faster. When all of these individual indexes were built, they were
+ merged together into the main index. That process usually took ~ a day.
+ </p></li>
+ </p>
+ </ul>
<p>
- Justin can be contacted at tvxh-lw4x at spamex.com.
- </p>
+ Daniel can be contacted at Armbrust.Daniel at mayo.edu.
+ </p>
</blockquote>
</td></tr>
<tr><td><br/></td></tr>



1.17 +1 -0 jakarta-lucene/docs/contributions.html

Index: contributions.html
===================================================================
RCS file: /home/cvs/jakarta-lucene/docs/contributions.html,v
retrieving revision 1.16
retrieving revision 1.17
diff -u -r1.16 -r1.17
--- contributions.html 4 Dec 2002 05:56:32 -0000 1.16
+++ contributions.html 12 Dec 2002 06:23:47 -0000 1.17
@@ -5,6 +5,7 @@

<!-- start the processing -->
<!-- ====================================================================== -->
+ <!-- GENERATED FILE, DO NOT EDIT, EDIT THE XML FILE IN xdocs INSTEAD! -->
<!-- Main Page Section -->
<!-- ====================================================================== -->
<html>



1.13 +1 -0 jakarta-lucene/docs/demo.html

Index: demo.html
===================================================================
RCS file: /home/cvs/jakarta-lucene/docs/demo.html,v
retrieving revision 1.12
retrieving revision 1.13
diff -u -r1.12 -r1.13
--- demo.html 4 Dec 2002 05:56:32 -0000 1.12
+++ demo.html 12 Dec 2002 06:23:47 -0000 1.13
@@ -5,6 +5,7 @@

<!-- start the processing -->
<!-- ====================================================================== -->
+ <!-- GENERATED FILE, DO NOT EDIT, EDIT THE XML FILE IN xdocs INSTEAD! -->
<!-- Main Page Section -->
<!-- ====================================================================== -->
<html>



1.13 +1 -0 jakarta-lucene/docs/demo2.html

Index: demo2.html
===================================================================
RCS file: /home/cvs/jakarta-lucene/docs/demo2.html,v
retrieving revision 1.12
retrieving revision 1.13
diff -u -r1.12 -r1.13
--- demo2.html 4 Dec 2002 05:56:32 -0000 1.12
+++ demo2.html 12 Dec 2002 06:23:47 -0000 1.13
@@ -5,6 +5,7 @@

<!-- start the processing -->
<!-- ====================================================================== -->
+ <!-- GENERATED FILE, DO NOT EDIT, EDIT THE XML FILE IN xdocs INSTEAD! -->
<!-- Main Page Section -->
<!-- ====================================================================== -->
<html>



1.15 +1 -0 jakarta-lucene/docs/demo3.html

Index: demo3.html
===================================================================
RCS file: /home/cvs/jakarta-lucene/docs/demo3.html,v
retrieving revision 1.14
retrieving revision 1.15
diff -u -r1.14 -r1.15
--- demo3.html 4 Dec 2002 05:56:32 -0000 1.14
+++ demo3.html 12 Dec 2002 06:23:47 -0000 1.15
@@ -5,6 +5,7 @@

<!-- start the processing -->
<!-- ====================================================================== -->
+ <!-- GENERATED FILE, DO NOT EDIT, EDIT THE XML FILE IN xdocs INSTEAD! -->
<!-- Main Page Section -->
<!-- ====================================================================== -->
<html>



1.13 +1 -0 jakarta-lucene/docs/demo4.html

Index: demo4.html
===================================================================
RCS file: /home/cvs/jakarta-lucene/docs/demo4.html,v
retrieving revision 1.12
retrieving revision 1.13
diff -u -r1.12 -r1.13
--- demo4.html 4 Dec 2002 05:56:32 -0000 1.12
+++ demo4.html 12 Dec 2002 06:23:47 -0000 1.13
@@ -5,6 +5,7 @@

<!-- start the processing -->
<!-- ====================================================================== -->
+ <!-- GENERATED FILE, DO NOT EDIT, EDIT THE XML FILE IN xdocs INSTEAD! -->
<!-- Main Page Section -->
<!-- ====================================================================== -->
<html>



1.6 +1 -0 jakarta-lucene/docs/fileformats.html

Index: fileformats.html
===================================================================
RCS file: /home/cvs/jakarta-lucene/docs/fileformats.html,v
retrieving revision 1.5
retrieving revision 1.6
diff -u -r1.5 -r1.6
--- fileformats.html 4 Dec 2002 05:56:32 -0000 1.5
+++ fileformats.html 12 Dec 2002 06:23:47 -0000 1.6
@@ -5,6 +5,7 @@

<!-- start the processing -->
<!-- ====================================================================== -->
+ <!-- GENERATED FILE, DO NOT EDIT, EDIT THE XML FILE IN xdocs INSTEAD! -->
<!-- Main Page Section -->
<!-- ====================================================================== -->
<html>



1.13 +1 -0 jakarta-lucene/docs/gettingstarted.html

Index: gettingstarted.html
===================================================================
RCS file: /home/cvs/jakarta-lucene/docs/gettingstarted.html,v
retrieving revision 1.12
retrieving revision 1.13
diff -u -r1.12 -r1.13
--- gettingstarted.html 4 Dec 2002 05:56:32 -0000 1.12
+++ gettingstarted.html 12 Dec 2002 06:23:47 -0000 1.13
@@ -5,6 +5,7 @@

<!-- start the processing -->
<!-- ====================================================================== -->
+ <!-- GENERATED FILE, DO NOT EDIT, EDIT THE XML FILE IN xdocs INSTEAD! -->
<!-- Main Page Section -->
<!-- ====================================================================== -->
<html>



1.24 +1 -0 jakarta-lucene/docs/index.html

Index: index.html
===================================================================
RCS file: /home/cvs/jakarta-lucene/docs/index.html,v
retrieving revision 1.23
retrieving revision 1.24
diff -u -r1.23 -r1.24
--- index.html 4 Dec 2002 05:56:32 -0000 1.23
+++ index.html 12 Dec 2002 06:23:47 -0000 1.24
@@ -5,6 +5,7 @@

<!-- start the processing -->
<!-- ====================================================================== -->
+ <!-- GENERATED FILE, DO NOT EDIT, EDIT THE XML FILE IN xdocs INSTEAD! -->
<!-- Main Page Section -->
<!-- ====================================================================== -->
<html>



1.14 +1 -0 jakarta-lucene/docs/luceneplan.html

Index: luceneplan.html
===================================================================
RCS file: /home/cvs/jakarta-lucene/docs/luceneplan.html,v
retrieving revision 1.13
retrieving revision 1.14
diff -u -r1.13 -r1.14
--- luceneplan.html 4 Dec 2002 05:56:32 -0000 1.13
+++ luceneplan.html 12 Dec 2002 06:23:47 -0000 1.14
@@ -5,6 +5,7 @@

<!-- start the processing -->
<!-- ====================================================================== -->
+ <!-- GENERATED FILE, DO NOT EDIT, EDIT THE XML FILE IN xdocs INSTEAD! -->
<!-- Main Page Section -->
<!-- ====================================================================== -->
<html>



1.22 +1 -0 jakarta-lucene/docs/powered.html

Index: powered.html
===================================================================
RCS file: /home/cvs/jakarta-lucene/docs/powered.html,v
retrieving revision 1.21
retrieving revision 1.22
diff -u -r1.21 -r1.22
--- powered.html 4 Dec 2002 05:56:32 -0000 1.21
+++ powered.html 12 Dec 2002 06:23:47 -0000 1.22
@@ -5,6 +5,7 @@

<!-- start the processing -->
<!-- ====================================================================== -->
+ <!-- GENERATED FILE, DO NOT EDIT, EDIT THE XML FILE IN xdocs INSTEAD! -->
<!-- Main Page Section -->
<!-- ====================================================================== -->
<html>



1.12 +1 -0 jakarta-lucene/docs/queryparsersyntax.html

Index: queryparsersyntax.html
===================================================================
RCS file: /home/cvs/jakarta-lucene/docs/queryparsersyntax.html,v
retrieving revision 1.11
retrieving revision 1.12
diff -u -r1.11 -r1.12
--- queryparsersyntax.html 4 Dec 2002 05:56:32 -0000 1.11
+++ queryparsersyntax.html 12 Dec 2002 06:23:47 -0000 1.12
@@ -5,6 +5,7 @@

<!-- start the processing -->
<!-- ====================================================================== -->
+ <!-- GENERATED FILE, DO NOT EDIT, EDIT THE XML FILE IN xdocs INSTEAD! -->
<!-- Main Page Section -->
<!-- ====================================================================== -->
<html>



1.20 +1 -0 jakarta-lucene/docs/resources.html

Index: resources.html
===================================================================
RCS file: /home/cvs/jakarta-lucene/docs/resources.html,v
retrieving revision 1.19
retrieving revision 1.20
diff -u -r1.19 -r1.20
--- resources.html 4 Dec 2002 05:56:32 -0000 1.19
+++ resources.html 12 Dec 2002 06:23:47 -0000 1.20
@@ -5,6 +5,7 @@

<!-- start the processing -->
<!-- ====================================================================== -->
+ <!-- GENERATED FILE, DO NOT EDIT, EDIT THE XML FILE IN xdocs INSTEAD! -->
<!-- Main Page Section -->
<!-- ====================================================================== -->
<html>



1.4 +1 -0 jakarta-lucene/docs/todo.html

Index: todo.html
===================================================================
RCS file: /home/cvs/jakarta-lucene/docs/todo.html,v
retrieving revision 1.3
retrieving revision 1.4
diff -u -r1.3 -r1.4
--- todo.html 4 Dec 2002 05:56:32 -0000 1.3
+++ todo.html 12 Dec 2002 06:23:47 -0000 1.4
@@ -5,6 +5,7 @@

<!-- start the processing -->
<!-- ====================================================================== -->
+ <!-- GENERATED FILE, DO NOT EDIT, EDIT THE XML FILE IN xdocs INSTEAD! -->
<!-- Main Page Section -->
<!-- ====================================================================== -->
<html>



1.20 +1 -0 jakarta-lucene/docs/whoweare.html

Index: whoweare.html
===================================================================
RCS file: /home/cvs/jakarta-lucene/docs/whoweare.html,v
retrieving revision 1.19
retrieving revision 1.20
diff -u -r1.19 -r1.20
--- whoweare.html 4 Dec 2002 05:56:32 -0000 1.19
+++ whoweare.html 12 Dec 2002 06:23:47 -0000 1.20
@@ -5,6 +5,7 @@

<!-- start the processing -->
<!-- ====================================================================== -->
+ <!-- GENERATED FILE, DO NOT EDIT, EDIT THE XML FILE IN xdocs INSTEAD! -->
<!-- Main Page Section -->
<!-- ====================================================================== -->
<html>



1.8 +1 -0 jakarta-lucene/docs/lucene-sandbox/index.html

Index: index.html
===================================================================
RCS file: /home/cvs/jakarta-lucene/docs/lucene-sandbox/index.html,v
retrieving revision 1.7
retrieving revision 1.8
diff -u -r1.7 -r1.8
--- index.html 4 Dec 2002 05:56:33 -0000 1.7
+++ index.html 12 Dec 2002 06:23:48 -0000 1.8
@@ -5,6 +5,7 @@

<!-- start the processing -->
<!-- ====================================================================== -->
+ <!-- GENERATED FILE, DO NOT EDIT, EDIT THE XML FILE IN xdocs INSTEAD! -->
<!-- Main Page Section -->
<!-- ====================================================================== -->
<html>



1.7 +1 -0 jakarta-lucene/docs/lucene-sandbox/indyo/tutorial.html

Index: tutorial.html
===================================================================
RCS file: /home/cvs/jakarta-lucene/docs/lucene-sandbox/indyo/tutorial.html,v
retrieving revision 1.6
retrieving revision 1.7
diff -u -r1.6 -r1.7
--- tutorial.html 4 Dec 2002 05:56:33 -0000 1.6
+++ tutorial.html 12 Dec 2002 06:23:48 -0000 1.7
@@ -5,6 +5,7 @@

<!-- start the processing -->
<!-- ====================================================================== -->
+ <!-- GENERATED FILE, DO NOT EDIT, EDIT THE XML FILE IN xdocs INSTEAD! -->
<!-- Main Page Section -->
<!-- ====================================================================== -->
<html>



1.6 +1 -0 jakarta-lucene/docs/lucene-sandbox/larm/overview.html

Index: overview.html
===================================================================
RCS file: /home/cvs/jakarta-lucene/docs/lucene-sandbox/larm/overview.html,v
retrieving revision 1.5
retrieving revision 1.6
diff -u -r1.5 -r1.6
--- overview.html 4 Dec 2002 05:56:33 -0000 1.5
+++ overview.html 12 Dec 2002 06:23:48 -0000 1.6
@@ -5,6 +5,7 @@

<!-- start the processing -->
<!-- ====================================================================== -->
+ <!-- GENERATED FILE, DO NOT EDIT, EDIT THE XML FILE IN xdocs INSTEAD! -->
<!-- Main Page Section -->
<!-- ====================================================================== -->
<html>



1.2 +337 -271 jakarta-lucene/xdocs/benchmarks.xml

Index: benchmarks.xml
===================================================================
RCS file: /home/cvs/jakarta-lucene/xdocs/benchmarks.xml,v
retrieving revision 1.1
retrieving revision 1.2
diff -u -r1.1 -r1.2
--- benchmarks.xml 4 Dec 2002 05:46:43 -0000 1.1
+++ benchmarks.xml 12 Dec 2002 06:23:48 -0000 1.2
@@ -1,283 +1,349 @@
<?xml version="1.0"?>
<document>
<properties>
- <author email="kelvint@apache.org">Kelvin Tan</author>
- <title>Resources - Performance Benchmarks</title>
+ <author email="kelvint@apache.org">Kelvin Tan</author>
+ <title>Resources - Performance Benchmarks</title>
</properties>
<body>

- <section name="Performance Benchmarks">
- <p>
- The purpose of these user-submitted performance figures is to
-give current and potential users of Lucene a sense
- of how well Lucene scales. If the requirements for an upcoming
-project is similar to an existing benchmark, you
- will also have something to work with when designing the system
-architecture for the application.
- </p>
- <p>
- If you've conducted performance tests with Lucene, we'd
-appreciate if you can submit these figures for display
- on this page. Post these figures to the lucene-user mailing list
-using this
- <a href="benchmarktemplate.xml">template</a>.
- </p>
- </section>
-
- <section name="Benchmark Variables">
- <p>
- <ul>
- <p>
- <b>Hardware Environment</b><br/>
- <li><i>Dedicated machine for indexing</i>: Self-explanatory
-(yes/no)</li>
- <li><i>CPU</i>: Self-explanatory (Type, Speed and Quantity)</li>
- <li><i>RAM</i>: Self-explanatory</li>
- <li><i>Drive configuration</i>: Self-explanatory (IDE, SCSI,
-RAID-1, RAID-5)</li>
- </p>
- <p>
- <b>Software environment</b><br/>
- <li><i>Java Version</i>: Version of Java SDK/JRE that is run
-</li>
- <li><i>Java VM</i>: Server/client VM, Sun VM/JRockIt</li>
- <li><i>OS Version</i>: Self-explanatory</li>
- <li><i>Location of index</i>: Is the index stored in filesystem
-or database? Is it on the same server(local) or
- over the network?</li>
- </p>
- <p>
- <b>Lucene indexing variables</b><br/>
- <li><i>Number of source documents</i>: Number of documents being
-indexed</li>
- <li><i>Total filesize of source documents</i>:
-Self-explanatory</li>
- <li><i>Average filesize of source documents</i>:
-Self-explanatory</li>
- <li><i>Source documents storage location</i>: Where are the
-documents being indexed located?
- Filesystem, DB, http,etc</li>
- <li><i>File type of source documents</i>: Types of files being
-indexed, e.g. HTML files, XML files, PDF files, etc.</li>
- <li><i>Parser(s) used, if any</i>: Parsers used for parsing the
-various files for indexing,
- e.g. XML parser, HTML parser, etc.</li>
- <li><i>Analyzer(s) used</i>: Type of Lucene analyzer used</li>
- <li><i>Number of fields per document</i>: Number of Fields each
-Document contains</li>
- <li><i>Type of fields</i>: Type of each field</li>
- <li><i>Index persistence</i>: Where the index is stored, e.g.
-FSDirectory, SqlDirectory, etc</li>
- </p>
- <p>
- <b>Figures</b><br/>
- <li><i>Time taken (in ms/s as an average of at least 3 indexing
-runs)</i>: Time taken to index all files</li>
- <li><i>Time taken / 1000 docs indexed</i>: Time taken to index
-1000 files</li>
- <li><i>Memory consumption</i>: Self-explanatory</li>
- </p>
- <p>
- <b>Notes</b><br/>
- <li><i>Notes</i>: Any comments which don't belong in the above,
-special tuning/strategies, etc</li>
- </p>
- </ul>
- </p>
- </section>
+ <section name="Performance Benchmarks">
+ <p>
+ The purpose of these user-submitted performance figures is to
+ give current and potential users of Lucene a sense
+ of how well Lucene scales. If the requirements for an upcoming
+ project is similar to an existing benchmark, you
+ will also have something to work with when designing the system
+ architecture for the application.
+ </p>
+ <p>
+ If you've conducted performance tests with Lucene, we'd
+ appreciate if you can submit these figures for display
+ on this page. Post these figures to the lucene-user mailing list
+ using this
+ <a href="benchmarktemplate.xml">template</a>.
+ </p>
+ </section>

- <section name="User-submitted Benchmarks">
- <p>
- These benchmarks have been kindly submitted by Lucene users for
-reference purposes.
- </p>
- <p><b>We make NO guarantees regarding their accuracy or
-validity.</b>
- </p>
- <p>We strongly recommend you conduct your own
- performance benchmarks before deciding on a particular
-hardware/software setup (and hopefully submit
- these figures to us).
- </p>
-
- <subsection name="Hamish Carpenter's benchmarks">
- <ul>
- <p>
- <b>Hardware Environment</b><br/>
- <li><i>Dedicated machine for indexing</i>: yes</li>
- <li><i>CPU</i>: Intel x86 P4 1.5Ghz</li>
- <li><i>RAM</i>: 512 DDR</li>
- <li><i>Drive configuration</i>: IDE 7200rpm Raid-1</li>
- </p>
- <p>
- <b>Software environment</b><br/>
- <li><i>Java Version</i>: 1.3.1 IBM JITC Enabled</li>
- <li><i>Java VM</i>: </li>
- <li><i>OS Version</i>: Debian Linux 2.4.18-686</li>
- <li><i>Location of index</i>: local</li>
- </p>
- <p>
- <b>Lucene indexing variables</b><br/>
- <li><i>Number of source documents</i>: Random generator. Set
-to make 1M documents
-in 2x500,000 batches.</li>
- <li><i>Total filesize of source documents</i>: > 1GB if
-stored</li>
- <li><i>Average filesize of source documents</i>: 1KB</li>
- <li><i>Source documents storage location</i>: Filesystem</li>
- <li><i>File type of source documents</i>: Generated</li>
- <li><i>Parser(s) used, if any</i>: </li>
- <li><i>Analyzer(s) used</i>: Default</li>
- <li><i>Number of fields per document</i>: 11</li>
- <li><i>Type of fields</i>: 1 date, 1 id, 9 text</li>
- <li><i>Index persistence</i>: FSDirectory</li>
- </p>
- <p>
- <b>Figures</b><br/>
- <li><i>Time taken (in ms/s as an average of at least 3
-indexing runs)</i>: </li>
- <li><i>Time taken / 1000 docs indexed</i>: 49 seconds</li>
- <li><i>Memory consumption</i>:</li>
- </p>
- <p>
- <b>Notes</b><br/>
- <li><i>Notes</i>:
- <p>
- A windows client ran a random document generator which
-created
- documents based on some arrays of values and an excerpt
-(approx 1kb)
- from a text file of the bible (King James version).<br/>
- These were submitted via a socket connection (open throughout
- indexing process).<br/>
- The index writer was not closed between index calls.<br/>
- This created a 400Mb index in 23 files (after
-optimization).<br/>
- </p>
- <p>
- <u>Query details</u>:<br/>
- </p>
- <p>
- Set up a threaded class to start x number of simultaneous
-threads to
- search the above created index.
- </p>
- <p>
- Query: +Domain:sos +(+((Name:goo*^2.0 Name:plan*^2.0)
-(Teaser:goo* Tea
- ser:plan*) (Details:goo* Details:plan*)) -Cancel:y)
- +DisplayStartDate:[mkwsw2jk0
- -mq3dj1uq0] +EndDate:[mq3dj1uq0-ntlxuggw0]
- </p>
- <p>
- This query counted 34000 documents and I limited the returned
-documents
- to 5.
- </p>
- <p>
- This is using Peter Halacsy's IndexSearcherCache slightly
-modified to
- be a singleton returned cached searchers for a given
-directory. This
- solved an initial problem with too many files open and
-running out of
- linux handles for them.
- </p>
- <pre>
- Threads|Avg Time per query (ms)
- 1 1009ms
- 2 2043ms
- 3 3087ms
- 4 4045ms
- .. .
- .. .
- 10 10091ms
- </pre>
- <p>
- I removed the two date range terms from the query and it made
-a HUGE
- difference in performance. With 4 threads the avg time
-dropped to 900ms!
- </p>
- <p>Other query optimizations made little difference.</p></li>
- </p>
- </ul>
- <p>
- Hamish can be contacted at hamish at catalyst.net.nz.
- </p>
- </subsection>
+ <section name="Benchmark Variables">
+ <p>
+ <ul>
+ <p>
+ <b>Hardware Environment</b><br/>
+ <li><i>Dedicated machine for indexing</i>: Self-explanatory
+ (yes/no)</li>
+ <li><i>CPU</i>: Self-explanatory (Type, Speed and Quantity)</li>
+ <li><i>RAM</i>: Self-explanatory</li>
+ <li><i>Drive configuration</i>: Self-explanatory (IDE, SCSI,
+ RAID-1, RAID-5)</li>
+ </p>
+ <p>
+ <b>Software environment</b><br/>
+ <li><i>Java Version</i>: Version of Java SDK/JRE that is run
+ </li>
+ <li><i>Java VM</i>: Server/client VM, Sun VM/JRockIt</li>
+ <li><i>OS Version</i>: Self-explanatory</li>
+ <li><i>Location of index</i>: Is the index stored in filesystem
+ or database? Is it on the same server(local) or
+ over the network?</li>
+ </p>
+ <p>
+ <b>Lucene indexing variables</b><br/>
+ <li><i>Number of source documents</i>: Number of documents being
+ indexed</li>
+ <li><i>Total filesize of source documents</i>:
+ Self-explanatory</li>
+ <li><i>Average filesize of source documents</i>:
+ Self-explanatory</li>
+ <li><i>Source documents storage location</i>: Where are the
+ documents being indexed located?
+ Filesystem, DB, http,etc</li>
+ <li><i>File type of source documents</i>: Types of files being
+ indexed, e.g. HTML files, XML files, PDF files, etc.</li>
+ <li><i>Parser(s) used, if any</i>: Parsers used for parsing the
+ various files for indexing,
+ e.g. XML parser, HTML parser, etc.</li>
+ <li><i>Analyzer(s) used</i>: Type of Lucene analyzer used</li>
+ <li><i>Number of fields per document</i>: Number of Fields each
+ Document contains</li>
+ <li><i>Type of fields</i>: Type of each field</li>
+ <li><i>Index persistence</i>: Where the index is stored, e.g.
+ FSDirectory, SqlDirectory, etc</li>
+ </p>
+ <p>
+ <b>Figures</b><br/>
+ <li><i>Time taken (in ms/s as an average of at least 3 indexing
+ runs)</i>: Time taken to index all files</li>
+ <li><i>Time taken / 1000 docs indexed</i>: Time taken to index
+ 1000 files</li>
+ <li><i>Memory consumption</i>: Self-explanatory</li>
+ </p>
+ <p>
+ <b>Notes</b><br/>
+ <li><i>Notes</i>: Any comments which don't belong in the above,
+ special tuning/strategies, etc</li>
+ </p>
+ </ul>
+ </p>
+ </section>

- <subsection name="Justin Greene's benchmarks">
- <ul>
- <p>
- <b>Hardware Environment</b><br/>
- <li><i>Dedicated machine for indexing</i>: No, but nominal
-usage at time of indexing.</li>
- <li><i>CPU</i>: Compaq Proliant 1850R/600 2 X pIII 600</li>
- <li><i>RAM</i>: 1GB, 256MB allocated to JVM.</li>
- <li><i>Drive configuration</i>: RAID 5 on Fibre Channel
-Array</li>
- </p>
- <p>
- <b>Software environment</b><br/>
- <li><i>Java Version</i>: 1.3.1_06</li>
- <li><i>Java VM</i>: </li>
- <li><i>OS Version</i>: Winnt 4/Sp6</li>
- <li><i>Location of index</i>: local</li>
- </p>
- <p>
- <b>Lucene indexing variables</b><br/>
- <li><i>Number of source documents</i>: about 60K</li>
- <li><i>Total filesize of source documents</i>: 6.5GB</li>
- <li><i>Average filesize of source documents</i>: 100K
-(6.5GB/60K documents)</li>
- <li><i>Source documents storage location</i>: filesystem on
-NTFS</li>
- <li><i>File type of source documents</i>: </li>
- <li><i>Parser(s) used, if any</i>: Currently the only parser
-used is the Quiotix html
- parser.</li>
- <li><i>Analyzer(s) used</i>: SimpleAnalyzer</li>
- <li><i>Number of fields per document</i>: 8</li>
- <li><i>Type of fields</i>: All strings, and all are stored
-and indexed.</li>
- <li><i>Index persistence</i>: FSDirectory</li>
- </p>
- <p>
- <b>Figures</b><br/>
- <li><i>Time taken (in ms/s as an average of at least 3
-indexing runs)</i>: 1 hour 12 minutes, 1 hour 14 minutes and 1 hour 17
-minutes. Note that the #
- and size of documents changes daily.</li>
- <li><i>Time taken / 1000 docs indexed</i>: </li>
- <li><i>Memory consumption</i>: JVM is given 256MB and uses it
-all.</li>
- </p>
- <p>
- <b>Notes</b><br/>
- <li><i>Notes</i>:
- <p>
- We have 10 threads reading files from the filesystem and
-parsing and
- analyzing them and the pushing them onto a queue and a single
-thread poping
- them from the queue and indexing. Note that we are indexing
-email messages
- and are storing the entire plaintext in of the message in the
-index. If the
- message contains attachment and we do not have a filter for
-the attachment
- (ie. we do not do PDFs yet), we discard the data.
- </p></li>
- </p>
- </ul>
- <p>
- Justin can be contacted at tvxh-lw4x at spamex.com.
- </p>
- </subsection>
+ <section name="User-submitted Benchmarks">
+ <p>
+ These benchmarks have been kindly submitted by Lucene users for
+ reference purposes.
+ </p>
+ <p><b>We make NO guarantees regarding their accuracy or
+ validity.</b>
+ </p>
+ <p>We strongly recommend you conduct your own
+ performance benchmarks before deciding on a particular
+ hardware/software setup (and hopefully submit
+ these figures to us).
+ </p>

- </section>
+ <subsection name="Hamish Carpenter's benchmarks">
+ <ul>
+ <p>
+ <b>Hardware Environment</b><br/>
+ <li><i>Dedicated machine for indexing</i>: yes</li>
+ <li><i>CPU</i>: Intel x86 P4 1.5Ghz</li>
+ <li><i>RAM</i>: 512 DDR</li>
+ <li><i>Drive configuration</i>: IDE 7200rpm Raid-1</li>
+ </p>
+ <p>
+ <b>Software environment</b><br/>
+ <li><i>Java Version</i>: 1.3.1 IBM JITC Enabled</li>
+ <li><i>Java VM</i>: </li>
+ <li><i>OS Version</i>: Debian Linux 2.4.18-686</li>
+ <li><i>Location of index</i>: local</li>
+ </p>
+ <p>
+ <b>Lucene indexing variables</b><br/>
+ <li><i>Number of source documents</i>: Random generator. Set
+ to make 1M documents
+ in 2x500,000 batches.</li>
+ <li><i>Total filesize of source documents</i>: > 1GB if
+ stored</li>
+ <li><i>Average filesize of source documents</i>: 1KB</li>
+ <li><i>Source documents storage location</i>: Filesystem</li>
+ <li><i>File type of source documents</i>: Generated</li>
+ <li><i>Parser(s) used, if any</i>: </li>
+ <li><i>Analyzer(s) used</i>: Default</li>
+ <li><i>Number of fields per document</i>: 11</li>
+ <li><i>Type of fields</i>: 1 date, 1 id, 9 text</li>
+ <li><i>Index persistence</i>: FSDirectory</li>
+ </p>
+ <p>
+ <b>Figures</b><br/>
+ <li><i>Time taken (in ms/s as an average of at least 3
+ indexing runs)</i>: </li>
+ <li><i>Time taken / 1000 docs indexed</i>: 49 seconds</li>
+ <li><i>Memory consumption</i>:</li>
+ </p>
+ <p>
+ <b>Notes</b><br/>
+ <li><i>Notes</i>:
+ <p>
+ A windows client ran a random document generator which
+ created
+ documents based on some arrays of values and an excerpt
+ (approx 1kb)
+ from a text file of the bible (King James version).<br/>
+ These were submitted via a socket connection (open throughout
+ indexing process).<br/>
+ The index writer was not closed between index calls.<br/>
+ This created a 400Mb index in 23 files (after
+ optimization).<br/>
+ </p>
+ <p>
+ <u>Query details</u>:<br/>
+ </p>
+ <p>
+ Set up a threaded class to start x number of simultaneous
+ threads to
+ search the above created index.
+ </p>
+ <p>
+ Query: +Domain:sos +(+((Name:goo*^2.0 Name:plan*^2.0)
+ (Teaser:goo* Tea
+ ser:plan*) (Details:goo* Details:plan*)) -Cancel:y)
+ +DisplayStartDate:[mkwsw2jk0
+ -mq3dj1uq0] +EndDate:[mq3dj1uq0-ntlxuggw0]
+ </p>
+ <p>
+ This query counted 34000 documents and I limited the returned
+ documents
+ to 5.
+ </p>
+ <p>
+ This is using Peter Halacsy's IndexSearcherCache slightly
+ modified to
+ be a singleton returned cached searchers for a given
+ directory. This
+ solved an initial problem with too many files open and
+ running out of
+ linux handles for them.
+ </p>
+ <pre>
+ Threads|Avg Time per query (ms)
+ 1 1009ms
+ 2 2043ms
+ 3 3087ms
+ 4 4045ms
+ .. .
+ .. .
+ 10 10091ms
+ </pre>
+ <p>
+ I removed the two date range terms from the query and it made
+ a HUGE
+ difference in performance. With 4 threads the avg time
+ dropped to 900ms!
+ </p>
+ <p>Other query optimizations made little difference.</p></li>
+ </p>
+ </ul>
+ <p>
+ Hamish can be contacted at hamish at catalyst.net.nz.
+ </p>
+ </subsection>
+
+ <subsection name="Justin Greene's benchmarks">
+ <ul>
+ <p>
+ <b>Hardware Environment</b><br/>
+ <li><i>Dedicated machine for indexing</i>: No, but nominal
+ usage at time of indexing.</li>
+ <li><i>CPU</i>: Compaq Proliant 1850R/600 2 X pIII 600</li>
+ <li><i>RAM</i>: 1GB, 256MB allocated to JVM.</li>
+ <li><i>Drive configuration</i>: RAID 5 on Fibre Channel
+ Array</li>
+ </p>
+ <p>
+ <b>Software environment</b><br/>
+ <li><i>Java Version</i>: 1.3.1_06</li>
+ <li><i>Java VM</i>: </li>
+ <li><i>OS Version</i>: Winnt 4/Sp6</li>
+ <li><i>Location of index</i>: local</li>
+ </p>
+ <p>
+ <b>Lucene indexing variables</b><br/>
+ <li><i>Number of source documents</i>: about 60K</li>
+ <li><i>Total filesize of source documents</i>: 6.5GB</li>
+ <li><i>Average filesize of source documents</i>: 100K
+ (6.5GB/60K documents)</li>
+ <li><i>Source documents storage location</i>: filesystem on
+ NTFS</li>
+ <li><i>File type of source documents</i>: </li>
+ <li><i>Parser(s) used, if any</i>: Currently the only parser
+ used is the Quiotix html
+ parser.</li>
+ <li><i>Analyzer(s) used</i>: SimpleAnalyzer</li>
+ <li><i>Number of fields per document</i>: 8</li>
+ <li><i>Type of fields</i>: All strings, and all are stored
+ and indexed.</li>
+ <li><i>Index persistence</i>: FSDirectory</li>
+ </p>
+ <p>
+ <b>Figures</b><br/>
+ <li><i>Time taken (in ms/s as an average of at least 3
+ indexing runs)</i>: 1 hour 12 minutes, 1 hour 14 minutes and 1 hour 17
+ minutes. Note that the #
+ and size of documents changes daily.</li>
+ <li><i>Time taken / 1000 docs indexed</i>: </li>
+ <li><i>Memory consumption</i>: JVM is given 256MB and uses it
+ all.</li>
+ </p>
+ <p>
+ <b>Notes</b><br/>
+ <li><i>Notes</i>:
+ <p>
+ We have 10 threads reading files from the filesystem and
+ parsing and
+ analyzing them and the pushing them onto a queue and a single
+ thread poping
+ them from the queue and indexing. Note that we are indexing
+ email messages
+ and are storing the entire plaintext in of the message in the
+ index. If the
+ message contains attachment and we do not have a filter for
+ the attachment
+ (ie. we do not do PDFs yet), we discard the data.
+ </p></li>
+ </p>
+ </ul>
+ <p>
+ Justin can be contacted at tvxh-lw4x at spamex.com.
+ </p>
+ </subsection>
+
+
+ <subsection name="Daniel Armbrust's benchmarks">
+ <p>
+ My disclaimer is that this is a very poor "Benchmark". It was not done for raw speed,
+ nor was the total index built in one shot. The index was created on several different
+ machines (all with these specs, or very similar), with each machine indexing batches of 500,000 to
+ 1 million documents per batch. Each of these small indexes was then moved to a
+ much larger drive, where they were all merged together into a big index.
+ This process was done manually, over the course of several months, as the sources became available.
+ </p>
+ <ul>
+ <p>
+ <b>Hardware Environment</b><br/>
+ <li><i>Dedicated machine for indexing</i>: no - The machine had moderate to low load. However, the indexing process was built single
+ threaded, so it only took advantage of 1 of the processors. It usually got 100% of this processor.</li>
+ <li><i>CPU</i>: Sun Ultra 80 4 x 64 bit processors</li>
+ <li><i>RAM</i>: 4 GB Memory</li>
+ <li><i>Drive configuration</i>: Ultra-SCSI Wide 10000 RPM 36GB Drive</li>
+ </p>
+ <p>
+ <b>Software environment</b><br/>
+ <li><i>Java Version</i>: 1.3.1</li>
+ <li><i>Java VM</i>: </li>
+ <li><i>OS Version</i>: Sun 5.8 (64 bit)</li>
+ <li><i>Location of index</i>: local</li>
+ </p>
+ <p>
+ <b>Lucene indexing variables</b><br/>
+ <li><i>Number of source documents</i>: 13,820,517</li>
+ <li><i>Total filesize of source documents</i>: 87.3 GB</li>
+ <li><i>Average filesize of source documents</i>: 6.3 KB</li>
+ <li><i>Source documents storage location</i>: Filesystem</li>
+ <li><i>File type of source documents</i>: XML</li>
+ <li><i>Parser(s) used, if any</i>: </li>
+ <li><i>Analyzer(s) used</i>: A home grown analyzer that simply removes stopwords.</li>
+ <li><i>Number of fields per document</i>: 1 - 31</li>
+ <li><i>Type of fields</i>: All text, though 2 of them are dates (20001205) that we filter on</li>
+ <li><i>Index persistence</i>: FSDirectory</li>
+ <li><i>Index size</i>: 12.5 GB</li>
+ </p>
+ <p>
+ <b>Figures</b><br/>
+ <li><i>Time taken (in ms/s as an average of at least 3
+ indexing runs)</i>: For 617271 documents, 209698 seconds (or ~2.5 days)</li>
+ <li><i>Time taken / 1000 docs indexed</i>: 340 Seconds</li>
+ <li><i>Memory consumption</i>: (java executed with) java -Xmx1000m -Xss8192k so
+ 1 GB of memory was allotted to the indexer</li>
+ </p>
+ <p>
+ <b>Notes</b><br/>
+ <li><i>Notes</i>:
+ <p>
+ The source documents were XML. The "indexer" opened each document one at a time, ran an
+ XSL transformation on them, and then proceeded to index the stream. The indexer optimized
+ the index every 50,000 documents (on this run) though previously, we optimized every
+ 300,000 documents. The performance didn't change much either way. We did no other
+ tuning (RAM Directories, separate process to pretransform the source material, etc)
+ to make it index faster. When all of these individual indexes were built, they were
+ merged together into the main index. That process usually took ~ a day.
+ </p></li>
+ </p>
+ </ul>
+ <p>
+ Daniel can be contacted at Armbrust.Daniel at mayo.edu.
+ </p>
+ </subsection>
+
+ </section>

</body>
</document>
-




--
To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>