Mailing List Archive: cvs commit: jakarta-lucene/xdocs benchmarks.xml

otis 2002/12/11 22:23:48

Modified: docs benchmarks.html contributions.html demo.html
demo2.html demo3.html demo4.html fileformats.html
gettingstarted.html index.html luceneplan.html
powered.html queryparsersyntax.html resources.html
todo.html whoweare.html
docs/lucene-sandbox index.html
docs/lucene-sandbox/indyo tutorial.html
docs/lucene-sandbox/larm overview.html
xdocs benchmarks.xml
Log:
- Modified docs.

Revision Changes Path
1.3 +324 -248 jakarta-lucene/docs/benchmarks.html

Index: benchmarks.html
===================================================================
RCS file: /home/cvs/jakarta-lucene/docs/benchmarks.html,v
retrieving revision 1.2
retrieving revision 1.3
diff -u -r1.2 -r1.3
--- benchmarks.html 4 Dec 2002 05:56:32 -0000 1.2
+++ benchmarks.html 12 Dec 2002 06:23:47 -0000 1.3
@@ -5,6 +5,7 @@



+ 


<html>
@@ -121,20 +122,20 @@
<tr><td>
<blockquote>

- The purpose of these user-submitted performance figures is to
-give current and potential users of Lucene a sense
- of how well Lucene scales. If the requirements for an upcoming
-project is similar to an existing benchmark, you
- will also have something to work with when designing the system
-architecture for the application.
- 
+ The purpose of these user-submitted performance figures is to
+ give current and potential users of Lucene a sense
+ of how well Lucene scales. If the requirements for an upcoming
+ project is similar to an existing benchmark, you
+ will also have something to work with when designing the system
+ architecture for the application.
+ 

- If you've conducted performance tests with Lucene, we'd
-appreciate if you can submit these figures for display
- on this page. Post these figures to the lucene-user mailing list
-using this
- <a href="benchmarktemplate.xml">template</a>.
- 
+ If you've conducted performance tests with Lucene, we'd
+ appreciate if you can submit these figures for display
+ on this page. Post these figures to the lucene-user mailing list
+ using this
+ <a href="benchmarktemplate.xml">template</a>.
+ 
</blockquote>

</td></tr>
@@ -149,64 +150,64 @@
<tr><td>
<blockquote>

- <ul>
- 
- Hardware Environment 
- <li>Dedicated machine for indexing: Self-explanatory
-(yes/no)</li>
- <li>CPU: Self-explanatory (Type, Speed and Quantity)</li>
- <li>RAM: Self-explanatory</li>
- <li>Drive configuration: Self-explanatory (IDE, SCSI,
-RAID-1, RAID-5)</li>
- 
- 
- Software environment 
- <li>Java Version: Version of Java SDK/JRE that is run
-</li>
- <li>Java VM: Server/client VM, Sun VM/JRockIt</li>
- <li>OS Version: Self-explanatory</li>
- <li>Location of index: Is the index stored in filesystem
-or database? Is it on the same server(local) or
- over the network?</li>
- 
- 
- Lucene indexing variables 
- <li>Number of source documents: Number of documents being
-indexed</li>
- <li>Total filesize of source documents:
-Self-explanatory</li>
- <li>Average filesize of source documents:
-Self-explanatory</li>
- <li>Source documents storage location: Where are the
-documents being indexed located?
- Filesystem, DB, http,etc</li>
- <li>File type of source documents: Types of files being
-indexed, e.g. HTML files, XML files, PDF files, etc.</li>
- <li>Parser(s) used, if any: Parsers used for parsing the
-various files for indexing,
- e.g. XML parser, HTML parser, etc.</li>
- <li>Analyzer(s) used: Type of Lucene analyzer used</li>
- <li>Number of fields per document: Number of Fields each
-Document contains</li>
- <li>Type of fields: Type of each field</li>
- <li>Index persistence: Where the index is stored, e.g.
-FSDirectory, SqlDirectory, etc</li>
- 
- 
- Figures 
- <li>Time taken (in ms/s as an average of at least 3 indexing
-runs): Time taken to index all files</li>
- <li>Time taken / 1000 docs indexed: Time taken to index
-1000 files</li>
- <li>Memory consumption: Self-explanatory</li>
- 
- 
- Notes 
- <li>Notes: Any comments which don't belong in the above,
-special tuning/strategies, etc</li>
- 
- </ul>
- 
+ <ul>
+ 
+ Hardware Environment 
+ <li>Dedicated machine for indexing: Self-explanatory
+ (yes/no)</li>
+ <li>CPU: Self-explanatory (Type, Speed and Quantity)</li>
+ <li>RAM: Self-explanatory</li>
+ <li>Drive configuration: Self-explanatory (IDE, SCSI,
+ RAID-1, RAID-5)</li>
+ 
+ 
+ Software environment 
+ <li>Java Version: Version of Java SDK/JRE that is run
+ </li>
+ <li>Java VM: Server/client VM, Sun VM/JRockIt</li>
+ <li>OS Version: Self-explanatory</li>
+ <li>Location of index: Is the index stored in filesystem
+ or database? Is it on the same server(local) or
+ over the network?</li>
+ 
+ 
+ Lucene indexing variables 
+ <li>Number of source documents: Number of documents being
+ indexed</li>
+ <li>Total filesize of source documents:
+ Self-explanatory</li>
+ <li>Average filesize of source documents:
+ Self-explanatory</li>
+ <li>Source documents storage location: Where are the
+ documents being indexed located?
+ Filesystem, DB, http,etc</li>
+ <li>File type of source documents: Types of files being
+ indexed, e.g. HTML files, XML files, PDF files, etc.</li>
+ <li>Parser(s) used, if any: Parsers used for parsing the
+ various files for indexing,
+ e.g. XML parser, HTML parser, etc.</li>
+ <li>Analyzer(s) used: Type of Lucene analyzer used</li>
+ <li>Number of fields per document: Number of Fields each
+ Document contains</li>
+ <li>Type of fields: Type of each field</li>
+ <li>Index persistence: Where the index is stored, e.g.
+ FSDirectory, SqlDirectory, etc</li>
+ 
+ 
+ Figures 
+ <li>Time taken (in ms/s as an average of at least 3 indexing
+ runs): Time taken to index all files</li>
+ <li>Time taken / 1000 docs indexed: Time taken to index
+ 1000 files</li>
+ <li>Memory consumption: Self-explanatory</li>
+ 
+ 
+ Notes 
+ <li>Notes: Any comments which don't belong in the above,
+ special tuning/strategies, etc</li>
+ 
+ </ul>
+ 
</blockquote>

</td></tr>
@@ -221,17 +222,17 @@
<tr><td>
<blockquote>

- These benchmarks have been kindly submitted by Lucene users for
-reference purposes.
- 
- We make NO guarantees regarding their accuracy or
-validity.
- 
- We strongly recommend you conduct your own
- performance benchmarks before deciding on a particular
-hardware/software setup (and hopefully submit
- these figures to us).
- 
+ These benchmarks have been kindly submitted by Lucene users for
+ reference purposes.
+ 
+ We make NO guarantees regarding their accuracy or
+ validity.
+ 
+ We strongly recommend you conduct your own
+ performance benchmarks before deciding on a particular
+ hardware/software setup (and hopefully submit
+ these figures to us).
+ 
<table border="0" cellspacing="0" cellpadding="2" width="100%">
<tr><td bgcolor="#828DA6">

@@ -241,109 +242,109 @@
<tr><td>
<blockquote>
<ul>
- 
- Hardware Environment 
- <li>Dedicated machine for indexing: yes</li>
- <li>CPU: Intel x86 P4 1.5Ghz</li>
- <li>RAM: 512 DDR</li>
- <li>Drive configuration: IDE 7200rpm Raid-1</li>
- 
- 
- Software environment 
- <li>Java Version: 1.3.1 IBM JITC Enabled</li>
- <li>Java VM: </li>
- <li>OS Version: Debian Linux 2.4.18-686</li>
- <li>Location of index: local</li>
- 
- 
- Lucene indexing variables 
- <li>Number of source documents: Random generator. Set
-to make 1M documents
-in 2x500,000 batches.</li>
- <li>Total filesize of source documents: > 1GB if
-stored</li>
- <li>Average filesize of source documents: 1KB</li>
- <li>Source documents storage location: Filesystem</li>
- <li>File type of source documents: Generated</li>
- <li>Parser(s) used, if any: </li>
- <li>Analyzer(s) used: Default</li>
- <li>Number of fields per document: 11</li>
- <li>Type of fields: 1 date, 1 id, 9 text</li>
- <li>Index persistence: FSDirectory</li>
- 
- 
- Figures 
- <li>Time taken (in ms/s as an average of at least 3
-indexing runs): </li>
- <li>Time taken / 1000 docs indexed: 49 seconds</li>
- <li>Memory consumption:</li>
- 
- 
- Notes 
- <li>Notes:
- 
- A windows client ran a random document generator which
-created
- documents based on some arrays of values and an excerpt
-(approx 1kb)
- from a text file of the bible (King James version). 
- These were submitted via a socket connection (open throughout
- indexing process). 
- The index writer was not closed between index calls. 
- This created a 400Mb index in 23 files (after
-optimization). 
- 
- 
- Query details: 
- 
- 
- Set up a threaded class to start x number of simultaneous
-threads to
- search the above created index.
- 
- 
- Query: +Domain:sos +(+((Name:goo*^2.0 Name:plan*^2.0)
-(Teaser:goo* Tea
- ser:plan*) (Details:goo* Details:plan*)) -Cancel:y)
- +DisplayStartDate:[mkwsw2jk0
- -mq3dj1uq0] +EndDate:[mq3dj1uq0-ntlxuggw0]
- 
- 
- This query counted 34000 documents and I limited the returned
-documents
- to 5.
- 
- 
- This is using Peter Halacsy's IndexSearcherCache slightly
-modified to
- be a singleton returned cached searchers for a given
-directory. This
- solved an initial problem with too many files open and
-running out of
- linux handles for them.
- 
- <pre>
- Threads|Avg Time per query (ms)
- 1 1009ms
- 2 2043ms
- 3 3087ms
- 4 4045ms
- .. .
- .. .
- 10 10091ms
- </pre>
- 
- I removed the two date range terms from the query and it made
-a HUGE
- difference in performance. With 4 threads the avg time
-dropped to 900ms!
- 
- Other query optimizations made little difference.</li>
- 
- </ul>
+ 
+ Hardware Environment 
+ <li>Dedicated machine for indexing: yes</li>
+ <li>CPU: Intel x86 P4 1.5Ghz</li>
+ <li>RAM: 512 DDR</li>
+ <li>Drive configuration: IDE 7200rpm Raid-1</li>
+ 
+ 
+ Software environment 
+ <li>Java Version: 1.3.1 IBM JITC Enabled</li>
+ <li>Java VM: </li>
+ <li>OS Version: Debian Linux 2.4.18-686</li>
+ <li>Location of index: local</li>
+ 
+ 
+ Lucene indexing variables 
+ <li>Number of source documents: Random generator. Set
+ to make 1M documents
+ in 2x500,000 batches.</li>
+ <li>Total filesize of source documents: > 1GB if
+ stored</li>
+ <li>Average filesize of source documents: 1KB</li>
+ <li>Source documents storage location: Filesystem</li>
+ <li>File type of source documents: Generated</li>
+ <li>Parser(s) used, if any: </li>
+ <li>Analyzer(s) used: Default</li>
+ <li>Number of fields per document: 11</li>
+ <li>Type of fields: 1 date, 1 id, 9 text</li>
+ <li>Index persistence: FSDirectory</li>
+ 
+ 
+ Figures 
+ <li>Time taken (in ms/s as an average of at least 3
+ indexing runs): </li>
+ <li>Time taken / 1000 docs indexed: 49 seconds</li>
+ <li>Memory consumption:</li>
+ 
+ 
+ Notes 
+ <li>Notes:
+ 
+ A windows client ran a random document generator which
+ created
+ documents based on some arrays of values and an excerpt
+ (approx 1kb)
+ from a text file of the bible (King James version). 
+ These were submitted via a socket connection (open throughout
+ indexing process). 
+ The index writer was not closed between index calls. 
+ This created a 400Mb index in 23 files (after
+ optimization). 
+ 
+ 
+ Query details: 
+ 
+ 
+ Set up a threaded class to start x number of simultaneous
+ threads to
+ search the above created index.
+ 
+ 
+ Query: +Domain:sos +(+((Name:goo*^2.0 Name:plan*^2.0)
+ (Teaser:goo* Tea
+ ser:plan*) (Details:goo* Details:plan*)) -Cancel:y)
+ +DisplayStartDate:[mkwsw2jk0
+ -mq3dj1uq0] +EndDate:[mq3dj1uq0-ntlxuggw0]
+ 
+ 
+ This query counted 34000 documents and I limited the returned
+ documents
+ to 5.
+ 
+ 
+ This is using Peter Halacsy's IndexSearcherCache slightly
+ modified to
+ be a singleton returned cached searchers for a given
+ directory. This
+ solved an initial problem with too many files open and
+ running out of
+ linux handles for them.
+ 
+ <pre>
+ Threads|Avg Time per query (ms)
+ 1 1009ms
+ 2 2043ms
+ 3 3087ms
+ 4 4045ms
+ .. .
+ .. .
+ 10 10091ms
+ </pre>
+ 
+ I removed the two date range terms from the query and it made
+ a HUGE
+ difference in performance. With 4 threads the avg time
+ dropped to 900ms!
+ 
+ Other query optimizations made little difference.</li>
+ 
+ </ul>

- Hamish can be contacted at hamish at catalyst.net.nz.
- 
+ Hamish can be contacted at hamish at catalyst.net.nz.
+ 
</blockquote>
</td></tr>
<tr><td> </td></tr>
@@ -357,71 +358,146 @@
<tr><td>
<blockquote>
<ul>
- 
- Hardware Environment 
- <li>Dedicated machine for indexing: No, but nominal
-usage at time of indexing.</li>
- <li>CPU: Compaq Proliant 1850R/600 2 X pIII 600</li>
- <li>RAM: 1GB, 256MB allocated to JVM.</li>
- <li>Drive configuration: RAID 5 on Fibre Channel
-Array</li>
- 
- 
- Software environment 
- <li>Java Version: 1.3.1_06</li>
- <li>Java VM: </li>
- <li>OS Version: Winnt 4/Sp6</li>
- <li>Location of index: local</li>
- 
- 
- Lucene indexing variables 
- <li>Number of source documents: about 60K</li>
- <li>Total filesize of source documents: 6.5GB</li>
- <li>Average filesize of source documents: 100K
-(6.5GB/60K documents)</li>
- <li>Source documents storage location: filesystem on
-NTFS</li>
- <li>File type of source documents: </li>
- <li>Parser(s) used, if any: Currently the only parser
-used is the Quiotix html
- parser.</li>
- <li>Analyzer(s) used: SimpleAnalyzer</li>
- <li>Number of fields per document: 8</li>
- <li>Type of fields: All strings, and all are stored
-and indexed.</li>
- <li>Index persistence: FSDirectory</li>
- 
- 
- Figures 
- <li>Time taken (in ms/s as an average of at least 3
-indexing runs): 1 hour 12 minutes, 1 hour 14 minutes and 1 hour 17
-minutes. Note that the #
- and size of documents changes daily.</li>
- <li>Time taken / 1000 docs indexed: </li>
- <li>Memory consumption: JVM is given 256MB and uses it
-all.</li>
- 
- 
- Notes 
- <li>Notes:
- 
- We have 10 threads reading files from the filesystem and
-parsing and
- analyzing them and the pushing them onto a queue and a single
-thread poping
- them from the queue and indexing. Note that we are indexing
-email messages
- and are storing the entire plaintext in of the message in the
-index. If the
- message contains attachment and we do not have a filter for
-the attachment
- (ie. we do not do PDFs yet), we discard the data.
- </li>
- 
- </ul>
+ 
+ Hardware Environment 
+ <li>Dedicated machine for indexing: No, but nominal
+ usage at time of indexing.</li>
+ <li>CPU: Compaq Proliant 1850R/600 2 X pIII 600</li>
+ <li>RAM: 1GB, 256MB allocated to JVM.</li>
+ <li>Drive configuration: RAID 5 on Fibre Channel
+ Array</li>
+ 
+ 
+ Software environment 
+ <li>Java Version: 1.3.1_06</li>
+ <li>Java VM: </li>
+ <li>OS Version: Winnt 4/Sp6</li>
+ <li>Location of index: local</li>
+ 
+ 
+ Lucene indexing variables 
+ <li>Number of source documents: about 60K</li>
+ <li>Total filesize of source documents: 6.5GB</li>
+ <li>Average filesize of source documents: 100K
+ (6.5GB/60K documents)</li>
+ <li>Source documents storage location: filesystem on
+ NTFS</li>
+ <li>File type of source documents: </li>
+ <li>Parser(s) used, if any: Currently the only parser
+ used is the Quiotix html
+ parser.</li>
+ <li>Analyzer(s) used: SimpleAnalyzer</li>
+ <li>Number of fields per document: 8</li>
+ <li>Type of fields: All strings, and all are stored
+ and indexed.</li>
+ <li>Index persistence: FSDirectory</li>
+ 
+ 
+ Figures 
+ <li>Time taken (in ms/s as an average of at least 3
+ indexing runs): 1 hour 12 minutes, 1 hour 14 minutes and 1 hour 17
+ minutes. Note that the #
+ and size of documents changes daily.</li>
+ <li>Time taken / 1000 docs indexed: </li>
+ <li>Memory consumption: JVM is given 256MB and uses it
+ all.</li>
+ 
+ 
+ Notes 
+ <li>Notes:
+ 
+ We have 10 threads reading files from the filesystem and
+ parsing and
+ analyzing them and the pushing them onto a queue and a single
+ thread poping
+ them from the queue and indexing. Note that we are indexing
+ email messages
+ and are storing the entire plaintext in of the message in the
+ index. If the
+ message contains attachment and we do not have a filter for
+ the attachment
+ (ie. we do not do PDFs yet), we discard the data.
+ </li>
+ 
+ </ul>
+ 
+ Justin can be contacted at tvxh-lw4x at spamex.com.
+ 
+ </blockquote>
+ </td></tr>
+ <tr><td> </td></tr>
+ </table>
+ <table border="0" cellspacing="0" cellpadding="2" width="100%">
+ <tr><td bgcolor="#828DA6">
+ 
+ <a name="Daniel Armbrust's benchmarks">Daniel Armbrust's benchmarks</a>
+ 
+ </td></tr>
+ <tr><td>
+ <blockquote>
+ 
+ My disclaimer is that this is a very poor "Benchmark". It was not done for raw speed,
+ nor was the total index built in one shot. The index was created on several different
+ machines (all with these specs, or very similar), with each machine indexing batches of 500,000 to
+ 1 million documents per batch. Each of these small indexes was then moved to a
+ much larger drive, where they were all merged together into a big index.
+ This process was done manually, over the course of several months, as the sources became available.
+ 
+ <ul>
+ 
+ Hardware Environment 
+ <li>Dedicated machine for indexing: no - The machine had moderate to low load. However, the indexing process was built single
+ threaded, so it only took advantage of 1 of the processors. It usually got 100% of this processor.</li>
+ <li>CPU: Sun Ultra 80 4 x 64 bit processors</li>
+ <li>RAM: 4 GB Memory</li>
+ <li>Drive configuration: Ultra-SCSI Wide 10000 RPM 36GB Drive</li>
+ 
+ 
+ Software environment 
+ <li>Java Version: 1.3.1</li>
+ <li>Java VM: </li>
+ <li>OS Version: Sun 5.8 (64 bit)</li>
+ <li>Location of index: local</li>
+ 
+ 
+ Lucene indexing variables 
+ <li>Number of source documents: 13,820,517</li>
+ <li>Total filesize of source documents: 87.3 GB</li>
+ <li>Average filesize of source documents: 6.3 KB</li>
+ <li>Source documents storage location: Filesystem</li>
+ <li>File type of source documents: XML</li>
+ <li>Parser(s) used, if any: </li>
+ <li>Analyzer(s) used: A home grown analyzer that simply removes stopwords.</li>
+ <li>Number of fields per document: 1 - 31</li>
+ <li>Type of fields: All text, though 2 of them are dates (20001205) that we filter on</li>
+ <li>Index persistence: FSDirectory</li>
+ <li>Index size: 12.5 GB</li>
+ 
+ 
+ Figures 
+ <li>Time taken (in ms/s as an average of at least 3
+ indexing runs): For 617271 documents, 209698 seconds (or ~2.5 days)</li>
+ <li>Time taken / 1000 docs indexed: 340 Seconds</li>
+ <li>Memory consumption: (java executed with) java -Xmx1000m -Xss8192k so
+ 1 GB of memory was allotted to the indexer</li>
+ 
+ 
+ Notes 
+ <li>Notes:
+ 
+ The source documents were XML. The "indexer" opened each document one at a time, ran an
+ XSL transformation on them, and then proceeded to index the stream. The indexer optimized
+ the index every 50,000 documents (on this run) though previously, we optimized every
+ 300,000 documents. The performance didn't change much either way. We did no other
+ tuning (RAM Directories, separate process to pretransform the source material, etc)
+ to make it index faster. When all of these individual indexes were built, they were
+ merged together into the main index. That process usually took ~ a day.
+ </li>
+ 
+ </ul>

- Justin can be contacted at tvxh-lw4x at spamex.com.
- 
+ Daniel can be contacted at Armbrust.Daniel at mayo.edu.
+ 
</blockquote>
</td></tr>
<tr><td> </td></tr>

1.17 +1 -0 jakarta-lucene/docs/contributions.html

Index: contributions.html
===================================================================
RCS file: /home/cvs/jakarta-lucene/docs/contributions.html,v
retrieving revision 1.16
retrieving revision 1.17
diff -u -r1.16 -r1.17
--- contributions.html 4 Dec 2002 05:56:32 -0000 1.16
+++ contributions.html 12 Dec 2002 06:23:47 -0000 1.17
@@ -5,6 +5,7 @@



+ 


<html>

1.13 +1 -0 jakarta-lucene/docs/demo.html

Index: demo.html
===================================================================
RCS file: /home/cvs/jakarta-lucene/docs/demo.html,v
retrieving revision 1.12
retrieving revision 1.13
diff -u -r1.12 -r1.13
--- demo.html 4 Dec 2002 05:56:32 -0000 1.12
+++ demo.html 12 Dec 2002 06:23:47 -0000 1.13
@@ -5,6 +5,7 @@



+ 


<html>

1.13 +1 -0 jakarta-lucene/docs/demo2.html

Index: demo2.html
===================================================================
RCS file: /home/cvs/jakarta-lucene/docs/demo2.html,v
retrieving revision 1.12
retrieving revision 1.13
diff -u -r1.12 -r1.13
--- demo2.html 4 Dec 2002 05:56:32 -0000 1.12
+++ demo2.html 12 Dec 2002 06:23:47 -0000 1.13
@@ -5,6 +5,7 @@



+ 


<html>

1.15 +1 -0 jakarta-lucene/docs/demo3.html

Index: demo3.html
===================================================================
RCS file: /home/cvs/jakarta-lucene/docs/demo3.html,v
retrieving revision 1.14
retrieving revision 1.15
diff -u -r1.14 -r1.15
--- demo3.html 4 Dec 2002 05:56:32 -0000 1.14
+++ demo3.html 12 Dec 2002 06:23:47 -0000 1.15
@@ -5,6 +5,7 @@



+ 


<html>

1.13 +1 -0 jakarta-lucene/docs/demo4.html

Index: demo4.html
===================================================================
RCS file: /home/cvs/jakarta-lucene/docs/demo4.html,v
retrieving revision 1.12
retrieving revision 1.13
diff -u -r1.12 -r1.13
--- demo4.html 4 Dec 2002 05:56:32 -0000 1.12
+++ demo4.html 12 Dec 2002 06:23:47 -0000 1.13
@@ -5,6 +5,7 @@



+ 


<html>

1.6 +1 -0 jakarta-lucene/docs/fileformats.html

Index: fileformats.html
===================================================================
RCS file: /home/cvs/jakarta-lucene/docs/fileformats.html,v
retrieving revision 1.5
retrieving revision 1.6
diff -u -r1.5 -r1.6
--- fileformats.html 4 Dec 2002 05:56:32 -0000 1.5
+++ fileformats.html 12 Dec 2002 06:23:47 -0000 1.6
@@ -5,6 +5,7 @@



+ 


<html>

1.13 +1 -0 jakarta-lucene/docs/gettingstarted.html

Index: gettingstarted.html
===================================================================
RCS file: /home/cvs/jakarta-lucene/docs/gettingstarted.html,v
retrieving revision 1.12
retrieving revision 1.13
diff -u -r1.12 -r1.13
--- gettingstarted.html 4 Dec 2002 05:56:32 -0000 1.12
+++ gettingstarted.html 12 Dec 2002 06:23:47 -0000 1.13
@@ -5,6 +5,7 @@



+ 


<html>

1.24 +1 -0 jakarta-lucene/docs/index.html

Index: index.html
===================================================================
RCS file: /home/cvs/jakarta-lucene/docs/index.html,v
retrieving revision 1.23
retrieving revision 1.24
diff -u -r1.23 -r1.24
--- index.html 4 Dec 2002 05:56:32 -0000 1.23
+++ index.html 12 Dec 2002 06:23:47 -0000 1.24
@@ -5,6 +5,7 @@



+ 


<html>

1.14 +1 -0 jakarta-lucene/docs/luceneplan.html

Index: luceneplan.html
===================================================================
RCS file: /home/cvs/jakarta-lucene/docs/luceneplan.html,v
retrieving revision 1.13
retrieving revision 1.14
diff -u -r1.13 -r1.14
--- luceneplan.html 4 Dec 2002 05:56:32 -0000 1.13
+++ luceneplan.html 12 Dec 2002 06:23:47 -0000 1.14
@@ -5,6 +5,7 @@



+ 


<html>

1.22 +1 -0 jakarta-lucene/docs/powered.html

Index: powered.html
===================================================================
RCS file: /home/cvs/jakarta-lucene/docs/powered.html,v
retrieving revision 1.21
retrieving revision 1.22
diff -u -r1.21 -r1.22
--- powered.html 4 Dec 2002 05:56:32 -0000 1.21
+++ powered.html 12 Dec 2002 06:23:47 -0000 1.22
@@ -5,6 +5,7 @@



+ 


<html>

1.12 +1 -0 jakarta-lucene/docs/queryparsersyntax.html

Index: queryparsersyntax.html
===================================================================
RCS file: /home/cvs/jakarta-lucene/docs/queryparsersyntax.html,v
retrieving revision 1.11
retrieving revision 1.12
diff -u -r1.11 -r1.12
--- queryparsersyntax.html 4 Dec 2002 05:56:32 -0000 1.11
+++ queryparsersyntax.html 12 Dec 2002 06:23:47 -0000 1.12
@@ -5,6 +5,7 @@



+ 


<html>

1.20 +1 -0 jakarta-lucene/docs/resources.html

Index: resources.html
===================================================================
RCS file: /home/cvs/jakarta-lucene/docs/resources.html,v
retrieving revision 1.19
retrieving revision 1.20
diff -u -r1.19 -r1.20
--- resources.html 4 Dec 2002 05:56:32 -0000 1.19
+++ resources.html 12 Dec 2002 06:23:47 -0000 1.20
@@ -5,6 +5,7 @@



+ 


<html>

1.4 +1 -0 jakarta-lucene/docs/todo.html

Index: todo.html
===================================================================
RCS file: /home/cvs/jakarta-lucene/docs/todo.html,v
retrieving revision 1.3
retrieving revision 1.4
diff -u -r1.3 -r1.4
--- todo.html 4 Dec 2002 05:56:32 -0000 1.3
+++ todo.html 12 Dec 2002 06:23:47 -0000 1.4
@@ -5,6 +5,7 @@



+ 


<html>

1.20 +1 -0 jakarta-lucene/docs/whoweare.html

Index: whoweare.html
===================================================================
RCS file: /home/cvs/jakarta-lucene/docs/whoweare.html,v
retrieving revision 1.19
retrieving revision 1.20
diff -u -r1.19 -r1.20
--- whoweare.html 4 Dec 2002 05:56:32 -0000 1.19
+++ whoweare.html 12 Dec 2002 06:23:47 -0000 1.20
@@ -5,6 +5,7 @@



+ 


<html>

1.8 +1 -0 jakarta-lucene/docs/lucene-sandbox/index.html

Index: index.html
===================================================================
RCS file: /home/cvs/jakarta-lucene/docs/lucene-sandbox/index.html,v
retrieving revision 1.7
retrieving revision 1.8
diff -u -r1.7 -r1.8
--- index.html 4 Dec 2002 05:56:33 -0000 1.7
+++ index.html 12 Dec 2002 06:23:48 -0000 1.8
@@ -5,6 +5,7 @@



+ 


<html>

1.7 +1 -0 jakarta-lucene/docs/lucene-sandbox/indyo/tutorial.html

Index: tutorial.html
===================================================================
RCS file: /home/cvs/jakarta-lucene/docs/lucene-sandbox/indyo/tutorial.html,v
retrieving revision 1.6
retrieving revision 1.7
diff -u -r1.6 -r1.7
--- tutorial.html 4 Dec 2002 05:56:33 -0000 1.6
+++ tutorial.html 12 Dec 2002 06:23:48 -0000 1.7
@@ -5,6 +5,7 @@



+ 


<html>

1.6 +1 -0 jakarta-lucene/docs/lucene-sandbox/larm/overview.html

Index: overview.html
===================================================================
RCS file: /home/cvs/jakarta-lucene/docs/lucene-sandbox/larm/overview.html,v
retrieving revision 1.5
retrieving revision 1.6
diff -u -r1.5 -r1.6
--- overview.html 4 Dec 2002 05:56:33 -0000 1.5
+++ overview.html 12 Dec 2002 06:23:48 -0000 1.6
@@ -5,6 +5,7 @@



+ 


<html>

1.2 +337 -271 jakarta-lucene/xdocs/benchmarks.xml

Index: benchmarks.xml
===================================================================
RCS file: /home/cvs/jakarta-lucene/xdocs/benchmarks.xml,v
retrieving revision 1.1
retrieving revision 1.2
diff -u -r1.1 -r1.2
--- benchmarks.xml 4 Dec 2002 05:46:43 -0000 1.1
+++ benchmarks.xml 12 Dec 2002 06:23:48 -0000 1.2
@@ -1,283 +1,349 @@
<?xml version="1.0"?>
<document>
<properties>
- <author email="kelvint@apache.org">Kelvin Tan</author>
- <title>Resources - Performance Benchmarks</title>
+ <author email="kelvint@apache.org">Kelvin Tan</author>
+ <title>Resources - Performance Benchmarks</title>
</properties>
<body>

- <section name="Performance Benchmarks">
- 
- The purpose of these user-submitted performance figures is to
-give current and potential users of Lucene a sense
- of how well Lucene scales. If the requirements for an upcoming
-project is similar to an existing benchmark, you
- will also have something to work with when designing the system
-architecture for the application.
- 
- 
- If you've conducted performance tests with Lucene, we'd
-appreciate if you can submit these figures for display
- on this page. Post these figures to the lucene-user mailing list
-using this
- <a href="benchmarktemplate.xml">template</a>.
- 
- </section>
-
- <section name="Benchmark Variables">
- 
- <ul>
- 
- Hardware Environment 
- <li>Dedicated machine for indexing: Self-explanatory
-(yes/no)</li>
- <li>CPU: Self-explanatory (Type, Speed and Quantity)</li>
- <li>RAM: Self-explanatory</li>
- <li>Drive configuration: Self-explanatory (IDE, SCSI,
-RAID-1, RAID-5)</li>
- 
- 
- Software environment 
- <li>Java Version: Version of Java SDK/JRE that is run
-</li>
- <li>Java VM: Server/client VM, Sun VM/JRockIt</li>
- <li>OS Version: Self-explanatory</li>
- <li>Location of index: Is the index stored in filesystem
-or database? Is it on the same server(local) or
- over the network?</li>
- 
- 
- Lucene indexing variables 
- <li>Number of source documents: Number of documents being
-indexed</li>
- <li>Total filesize of source documents:
-Self-explanatory</li>
- <li>Average filesize of source documents:
-Self-explanatory</li>
- <li>Source documents storage location: Where are the
-documents being indexed located?
- Filesystem, DB, http,etc</li>
- <li>File type of source documents: Types of files being
-indexed, e.g. HTML files, XML files, PDF files, etc.</li>
- <li>Parser(s) used, if any: Parsers used for parsing the
-various files for indexing,
- e.g. XML parser, HTML parser, etc.</li>
- <li>Analyzer(s) used: Type of Lucene analyzer used</li>
- <li>Number of fields per document: Number of Fields each
-Document contains</li>
- <li>Type of fields: Type of each field</li>
- <li>Index persistence: Where the index is stored, e.g.
-FSDirectory, SqlDirectory, etc</li>
- 
- 
- Figures 
- <li>Time taken (in ms/s as an average of at least 3 indexing
-runs): Time taken to index all files</li>
- <li>Time taken / 1000 docs indexed: Time taken to index
-1000 files</li>
- <li>Memory consumption: Self-explanatory</li>
- 
- 
- Notes 
- <li>Notes: Any comments which don't belong in the above,
-special tuning/strategies, etc</li>
- 
- </ul>
- 
- </section>
+ <section name="Performance Benchmarks">
+ 
+ The purpose of these user-submitted performance figures is to
+ give current and potential users of Lucene a sense
+ of how well Lucene scales. If the requirements for an upcoming
+ project is similar to an existing benchmark, you
+ will also have something to work with when designing the system
+ architecture for the application.
+ 
+ 
+ If you've conducted performance tests with Lucene, we'd
+ appreciate if you can submit these figures for display
+ on this page. Post these figures to the lucene-user mailing list
+ using this
+ <a href="benchmarktemplate.xml">template</a>.
+ 
+ </section>

- <section name="User-submitted Benchmarks">
- 
- These benchmarks have been kindly submitted by Lucene users for
-reference purposes.
- 
- We make NO guarantees regarding their accuracy or
-validity.
- 
- We strongly recommend you conduct your own
- performance benchmarks before deciding on a particular
-hardware/software setup (and hopefully submit
- these figures to us).
- 
-
- <subsection name="Hamish Carpenter's benchmarks">
- <ul>
- 
- Hardware Environment 
- <li>Dedicated machine for indexing: yes</li>
- <li>CPU: Intel x86 P4 1.5Ghz</li>
- <li>RAM: 512 DDR</li>
- <li>Drive configuration: IDE 7200rpm Raid-1</li>
- 
- 
- Software environment 
- <li>Java Version: 1.3.1 IBM JITC Enabled</li>
- <li>Java VM: </li>
- <li>OS Version: Debian Linux 2.4.18-686</li>
- <li>Location of index: local</li>
- 
- 
- Lucene indexing variables 
- <li>Number of source documents: Random generator. Set
-to make 1M documents
-in 2x500,000 batches.</li>
- <li>Total filesize of source documents: > 1GB if
-stored</li>
- <li>Average filesize of source documents: 1KB</li>
- <li>Source documents storage location: Filesystem</li>
- <li>File type of source documents: Generated</li>
- <li>Parser(s) used, if any: </li>
- <li>Analyzer(s) used: Default</li>
- <li>Number of fields per document: 11</li>
- <li>Type of fields: 1 date, 1 id, 9 text</li>
- <li>Index persistence: FSDirectory</li>
- 
- 
- Figures 
- <li>Time taken (in ms/s as an average of at least 3
-indexing runs): </li>
- <li>Time taken / 1000 docs indexed: 49 seconds</li>
- <li>Memory consumption:</li>
- 
- 
- Notes 
- <li>Notes:
- 
- A windows client ran a random document generator which
-created
- documents based on some arrays of values and an excerpt
-(approx 1kb)
- from a text file of the bible (King James version). 
- These were submitted via a socket connection (open throughout
- indexing process). 
- The index writer was not closed between index calls. 
- This created a 400Mb index in 23 files (after
-optimization). 
- 
- 
- Query details: 
- 
- 
- Set up a threaded class to start x number of simultaneous
-threads to
- search the above created index.
- 
- 
- Query: +Domain:sos +(+((Name:goo*^2.0 Name:plan*^2.0)
-(Teaser:goo* Tea
- ser:plan*) (Details:goo* Details:plan*)) -Cancel:y)
- +DisplayStartDate:[mkwsw2jk0
- -mq3dj1uq0] +EndDate:[mq3dj1uq0-ntlxuggw0]
- 
- 
- This query counted 34000 documents and I limited the returned
-documents
- to 5.
- 
- 
- This is using Peter Halacsy's IndexSearcherCache slightly
-modified to
- be a singleton returned cached searchers for a given
-directory. This
- solved an initial problem with too many files open and
-running out of
- linux handles for them.
- 
- <pre>
- Threads|Avg Time per query (ms)
- 1 1009ms
- 2 2043ms
- 3 3087ms
- 4 4045ms
- .. .
- .. .
- 10 10091ms
- </pre>
- 
- I removed the two date range terms from the query and it made
-a HUGE
- difference in performance. With 4 threads the avg time
-dropped to 900ms!
- 
- Other query optimizations made little difference.</li>
- 
- </ul>
- 
- Hamish can be contacted at hamish at catalyst.net.nz.
- 
- </subsection>
+ <section name="Benchmark Variables">
+ 
+ <ul>
+ 
+ Hardware Environment 
+ <li>Dedicated machine for indexing: Self-explanatory
+ (yes/no)</li>
+ <li>CPU: Self-explanatory (Type, Speed and Quantity)</li>
+ <li>RAM: Self-explanatory</li>
+ <li>Drive configuration: Self-explanatory (IDE, SCSI,
+ RAID-1, RAID-5)</li>
+ 
+ 
+ Software environment 
+ <li>Java Version: Version of Java SDK/JRE that is run
+ </li>
+ <li>Java VM: Server/client VM, Sun VM/JRockIt</li>
+ <li>OS Version: Self-explanatory</li>
+ <li>Location of index: Is the index stored in filesystem
+ or database? Is it on the same server(local) or
+ over the network?</li>
+ 
+ 
+ Lucene indexing variables 
+ <li>Number of source documents: Number of documents being
+ indexed</li>
+ <li>Total filesize of source documents:
+ Self-explanatory</li>
+ <li>Average filesize of source documents:
+ Self-explanatory</li>
+ <li>Source documents storage location: Where are the
+ documents being indexed located?
+ Filesystem, DB, http,etc</li>
+ <li>File type of source documents: Types of files being
+ indexed, e.g. HTML files, XML files, PDF files, etc.</li>
+ <li>Parser(s) used, if any: Parsers used for parsing the
+ various files for indexing,
+ e.g. XML parser, HTML parser, etc.</li>
+ <li>Analyzer(s) used: Type of Lucene analyzer used</li>
+ <li>Number of fields per document: Number of Fields each
+ Document contains</li>
+ <li>Type of fields: Type of each field</li>
+ <li>Index persistence: Where the index is stored, e.g.
+ FSDirectory, SqlDirectory, etc</li>
+ 
+ 
+ Figures 
+ <li>Time taken (in ms/s as an average of at least 3 indexing
+ runs): Time taken to index all files</li>
+ <li>Time taken / 1000 docs indexed: Time taken to index
+ 1000 files</li>
+ <li>Memory consumption: Self-explanatory</li>
+ 
+ 
+ Notes 
+ <li>Notes: Any comments which don't belong in the above,
+ special tuning/strategies, etc</li>
+ 
+ </ul>
+ 
+ </section>

- <subsection name="Justin Greene's benchmarks">
- <ul>
- 
- Hardware Environment 
- <li>Dedicated machine for indexing: No, but nominal
-usage at time of indexing.</li>
- <li>CPU: Compaq Proliant 1850R/600 2 X pIII 600</li>
- <li>RAM: 1GB, 256MB allocated to JVM.</li>
- <li>Drive configuration: RAID 5 on Fibre Channel
-Array</li>
- 
- 
- Software environment 
- <li>Java Version: 1.3.1_06</li>
- <li>Java VM: </li>
- <li>OS Version: Winnt 4/Sp6</li>
- <li>Location of index: local</li>
- 
- 
- Lucene indexing variables 
- <li>Number of source documents: about 60K</li>
- <li>Total filesize of source documents: 6.5GB</li>
- <li>Average filesize of source documents: 100K
-(6.5GB/60K documents)</li>
- <li>Source documents storage location: filesystem on
-NTFS</li>
- <li>File type of source documents: </li>
- <li>Parser(s) used, if any: Currently the only parser
-used is the Quiotix html
- parser.</li>
- <li>Analyzer(s) used: SimpleAnalyzer</li>
- <li>Number of fields per document: 8</li>
- <li>Type of fields: All strings, and all are stored
-and indexed.</li>
- <li>Index persistence: FSDirectory</li>
- 
- 
- Figures 
- <li>Time taken (in ms/s as an average of at least 3
-indexing runs): 1 hour 12 minutes, 1 hour 14 minutes and 1 hour 17
-minutes. Note that the #
- and size of documents changes daily.</li>
- <li>Time taken / 1000 docs indexed: </li>
- <li>Memory consumption: JVM is given 256MB and uses it
-all.</li>
- 
- 
- Notes 
- <li>Notes:
- 
- We have 10 threads reading files from the filesystem and
-parsing and
- analyzing them and the pushing them onto a queue and a single
-thread poping
- them from the queue and indexing. Note that we are indexing
-email messages
- and are storing the entire plaintext in of the message in the
-index. If the
- message contains attachment and we do not have a filter for
-the attachment
- (ie. we do not do PDFs yet), we discard the data.
- </li>
- 
- </ul>
- 
- Justin can be contacted at tvxh-lw4x at spamex.com.
- 
- </subsection>
+ <section name="User-submitted Benchmarks">
+ 
+ These benchmarks have been kindly submitted by Lucene users for
+ reference purposes.
+ 
+ We make NO guarantees regarding their accuracy or
+ validity.
+ 
+ We strongly recommend you conduct your own
+ performance benchmarks before deciding on a particular
+ hardware/software setup (and hopefully submit
+ these figures to us).
+ 

- </section>
+ <subsection name="Hamish Carpenter's benchmarks">
+ <ul>
+ 
+ Hardware Environment 
+ <li>Dedicated machine for indexing: yes</li>
+ <li>CPU: Intel x86 P4 1.5Ghz</li>
+ <li>RAM: 512 DDR</li>
+ <li>Drive configuration: IDE 7200rpm Raid-1</li>
+ 
+ 
+ Software environment 
+ <li>Java Version: 1.3.1 IBM JITC Enabled</li>
+ <li>Java VM: </li>
+ <li>OS Version: Debian Linux 2.4.18-686</li>
+ <li>Location of index: local</li>
+ 
+ 
+ Lucene indexing variables 
+ <li>Number of source documents: Random generator. Set
+ to make 1M documents
+ in 2x500,000 batches.</li>
+ <li>Total filesize of source documents: > 1GB if
+ stored</li>
+ <li>Average filesize of source documents: 1KB</li>
+ <li>Source documents storage location: Filesystem</li>
+ <li>File type of source documents: Generated</li>
+ <li>Parser(s) used, if any: </li>
+ <li>Analyzer(s) used: Default</li>
+ <li>Number of fields per document: 11</li>
+ <li>Type of fields: 1 date, 1 id, 9 text</li>
+ <li>Index persistence: FSDirectory</li>
+ 
+ 
+ Figures 
+ <li>Time taken (in ms/s as an average of at least 3
+ indexing runs): </li>
+ <li>Time taken / 1000 docs indexed: 49 seconds</li>
+ <li>Memory consumption:</li>
+ 
+ 
+ Notes 
+ <li>Notes:
+ 
+ A windows client ran a random document generator which
+ created
+ documents based on some arrays of values and an excerpt
+ (approx 1kb)
+ from a text file of the bible (King James version). 
+ These were submitted via a socket connection (open throughout
+ indexing process). 
+ The index writer was not closed between index calls. 
+ This created a 400Mb index in 23 files (after
+ optimization). 
+ 
+ 
+ Query details: 
+ 
+ 
+ Set up a threaded class to start x number of simultaneous
+ threads to
+ search the above created index.
+ 
+ 
+ Query: +Domain:sos +(+((Name:goo*^2.0 Name:plan*^2.0)
+ (Teaser:goo* Tea
+ ser:plan*) (Details:goo* Details:plan*)) -Cancel:y)
+ +DisplayStartDate:[mkwsw2jk0
+ -mq3dj1uq0] +EndDate:[mq3dj1uq0-ntlxuggw0]
+ 
+ 
+ This query counted 34000 documents and I limited the returned
+ documents
+ to 5.
+ 
+ 
+ This is using Peter Halacsy's IndexSearcherCache slightly
+ modified to
+ be a singleton returned cached searchers for a given
+ directory. This
+ solved an initial problem with too many files open and
+ running out of
+ linux handles for them.
+ 
+ <pre>
+ Threads|Avg Time per query (ms)
+ 1 1009ms
+ 2 2043ms
+ 3 3087ms
+ 4 4045ms
+ .. .
+ .. .
+ 10 10091ms
+ </pre>
+ 
+ I removed the two date range terms from the query and it made
+ a HUGE
+ difference in performance. With 4 threads the avg time
+ dropped to 900ms!
+ 
+ Other query optimizations made little difference.</li>
+ 
+ </ul>
+ 
+ Hamish can be contacted at hamish at catalyst.net.nz.
+ 
+ </subsection>
+
+ <subsection name="Justin Greene's benchmarks">
+ <ul>
+ 
+ Hardware Environment 
+ <li>Dedicated machine for indexing: No, but nominal
+ usage at time of indexing.</li>
+ <li>CPU: Compaq Proliant 1850R/600 2 X pIII 600</li>
+ <li>RAM: 1GB, 256MB allocated to JVM.</li>
+ <li>Drive configuration: RAID 5 on Fibre Channel
+ Array</li>
+ 
+ 
+ Software environment 
+ <li>Java Version: 1.3.1_06</li>
+ <li>Java VM: </li>
+ <li>OS Version: Winnt 4/Sp6</li>
+ <li>Location of index: local</li>
+ 
+ 
+ Lucene indexing variables 
+ <li>Number of source documents: about 60K</li>
+ <li>Total filesize of source documents: 6.5GB</li>
+ <li>Average filesize of source documents: 100K
+ (6.5GB/60K documents)</li>
+ <li>Source documents storage location: filesystem on
+ NTFS</li>
+ <li>File type of source documents: </li>
+ <li>Parser(s) used, if any: Currently the only parser
+ used is the Quiotix html
+ parser.</li>
+ <li>Analyzer(s) used: SimpleAnalyzer</li>
+ <li>Number of fields per document: 8</li>
+ <li>Type of fields: All strings, and all are stored
+ and indexed.</li>
+ <li>Index persistence: FSDirectory</li>
+ 
+ 
+ Figures 
+ <li>Time taken (in ms/s as an average of at least 3
+ indexing runs): 1 hour 12 minutes, 1 hour 14 minutes and 1 hour 17
+ minutes. Note that the #
+ and size of documents changes daily.</li>
+ <li>Time taken / 1000 docs indexed: </li>
+ <li>Memory consumption: JVM is given 256MB and uses it
+ all.</li>
+ 
+ 
+ Notes 
+ <li>Notes:
+ 
+ We have 10 threads reading files from the filesystem and
+ parsing and
+ analyzing them and the pushing them onto a queue and a single
+ thread poping
+ them from the queue and indexing. Note that we are indexing
+ email messages
+ and are storing the entire plaintext in of the message in the
+ index. If the
+ message contains attachment and we do not have a filter for
+ the attachment
+ (ie. we do not do PDFs yet), we discard the data.
+ </li>
+ 
+ </ul>
+ 
+ Justin can be contacted at tvxh-lw4x at spamex.com.
+ 
+ </subsection>
+
+
+ <subsection name="Daniel Armbrust's benchmarks">
+ 
+ My disclaimer is that this is a very poor "Benchmark". It was not done for raw speed,
+ nor was the total index built in one shot. The index was created on several different
+ machines (all with these specs, or very similar), with each machine indexing batches of 500,000 to
+ 1 million documents per batch. Each of these small indexes was then moved to a
+ much larger drive, where they were all merged together into a big index.
+ This process was done manually, over the course of several months, as the sources became available.
+ 
+ <ul>
+ 
+ Hardware Environment 
+ <li>Dedicated machine for indexing: no - The machine had moderate to low load. However, the indexing process was built single
+ threaded, so it only took advantage of 1 of the processors. It usually got 100% of this processor.</li>
+ <li>CPU: Sun Ultra 80 4 x 64 bit processors</li>
+ <li>RAM: 4 GB Memory</li>
+ <li>Drive configuration: Ultra-SCSI Wide 10000 RPM 36GB Drive</li>
+ 
+ 
+ Software environment 
+ <li>Java Version: 1.3.1</li>
+ <li>Java VM: </li>
+ <li>OS Version: Sun 5.8 (64 bit)</li>
+ <li>Location of index: local</li>
+ 
+ 
+ Lucene indexing variables 
+ <li>Number of source documents: 13,820,517</li>
+ <li>Total filesize of source documents: 87.3 GB</li>
+ <li>Average filesize of source documents: 6.3 KB</li>
+ <li>Source documents storage location: Filesystem</li>
+ <li>File type of source documents: XML</li>
+ <li>Parser(s) used, if any: </li>
+ <li>Analyzer(s) used: A home grown analyzer that simply removes stopwords.</li>
+ <li>Number of fields per document: 1 - 31</li>
+ <li>Type of fields: All text, though 2 of them are dates (20001205) that we filter on</li>
+ <li>Index persistence: FSDirectory</li>
+ <li>Index size: 12.5 GB</li>
+ 
+ 
+ Figures 
+ <li>Time taken (in ms/s as an average of at least 3
+ indexing runs): For 617271 documents, 209698 seconds (or ~2.5 days)</li>
+ <li>Time taken / 1000 docs indexed: 340 Seconds</li>
+ <li>Memory consumption: (java executed with) java -Xmx1000m -Xss8192k so
+ 1 GB of memory was allotted to the indexer</li>
+ 
+ 
+ Notes 
+ <li>Notes:
+ 
+ The source documents were XML. The "indexer" opened each document one at a time, ran an
+ XSL transformation on them, and then proceeded to index the stream. The indexer optimized
+ the index every 50,000 documents (on this run) though previously, we optimized every
+ 300,000 documents. The performance didn't change much either way. We did no other
+ tuning (RAM Directories, separate process to pretransform the source material, etc)
+ to make it index faster. When all of these individual indexes were built, they were
+ merged together into the main index. That process usually took ~ a day.
+ </li>
+ 
+ </ul>
+ 
+ Daniel can be contacted at Armbrust.Daniel at mayo.edu.
+ 
+ </subsection>
+
+ </section>

</body>
</document>
-

--
To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>