Some performance numbers
Java Version: 1.3_01
OS Version: Windows 2000
CPU (Type, Speed and Quantity): Pentium 4, 1.5 GHz, 1 CPU
RAM: 512 MB
Drive configuration (IDE, SCSI, RAID-1, RAID-5): IDE (single)
Number of source documents: 103009
Total filesize of source documents: 430MB
Average filesize of source documents (in KB/MB): 4.3KB
Source documents storage location (filesystem, DB, http,etc): Filesystem
File type of source documents: xml
Parser(s) used, if any: Standard Analyzer
Number of Fields per document: 8
Time taken (in ms/s as an average of at least 3 indexing runs): 8387 sec
(139 min)
Time taken / 1000 docs indexed: 81 sec / 1000 docs
Notes (any special tuning/strategies):
I convert each document to a DOM, and use xpath to get the fields.
I perform validation on the data and make sure that it meets certain
criteria like total size > 150 characters, and verify there are no
duplicates using a Hashmap. Without these checks, the indexing goes faster
(about 60 seconds/1000 docs).
I hope this is helpful.
--Peter
--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>
Java Version: 1.3_01
OS Version: Windows 2000
CPU (Type, Speed and Quantity): Pentium 4, 1.5 GHz, 1 CPU
RAM: 512 MB
Drive configuration (IDE, SCSI, RAID-1, RAID-5): IDE (single)
Number of source documents: 103009
Total filesize of source documents: 430MB
Average filesize of source documents (in KB/MB): 4.3KB
Source documents storage location (filesystem, DB, http,etc): Filesystem
File type of source documents: xml
Parser(s) used, if any: Standard Analyzer
Number of Fields per document: 8
Time taken (in ms/s as an average of at least 3 indexing runs): 8387 sec
(139 min)
Time taken / 1000 docs indexed: 81 sec / 1000 docs
Notes (any special tuning/strategies):
I convert each document to a DOM, and use xpath to get the fields.
I perform validation on the data and make sure that it meets certain
criteria like total size > 150 characters, and verify there are no
duplicates using a Hashmap. Without these checks, the indexing goes faster
(about 60 seconds/1000 docs).
I hope this is helpful.
--Peter
--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>