Dear all,
we are using Lucene to store 10Mio bilingual sentence pairs for doing some
natural language processing with them. Each documents contains a sentence,
its translation and a topical code. We want to select sentences containing
certain words and do statistics over the topical codes in order to detect
translations which depend on the topic (like key=> Taste (topic: input
devices), key=> Schlüssel (topic: cryptography)).
While the search is carried out in a reasonably short time (about
500..800ms) we have a performance problem with actually retrieving the
documents by code like:
for (int i = nrhits-1; i >=0; i--){
Document hitDoc = hits.doc(i);
String code=hitDoc.get("code");
... statistics
}
Even when restricting nrhits to 2000, we have to wait 10..20 seconds just
for the retrieval. Since the documents are so short we would have expected
a quicker retrieval. BtW the loop was done in inverse order in the hope to
accelerate the retrieval.
We are using Lucene 1.4.3 Java version on a Windows PC.
Would you recommend using the C version ? I suppose it is stable and we
can reuse the database ? Any other suggestions ?
Thanks for your help !
Wolfgang
we are using Lucene to store 10Mio bilingual sentence pairs for doing some
natural language processing with them. Each documents contains a sentence,
its translation and a topical code. We want to select sentences containing
certain words and do statistics over the topical codes in order to detect
translations which depend on the topic (like key=> Taste (topic: input
devices), key=> Schlüssel (topic: cryptography)).
While the search is carried out in a reasonably short time (about
500..800ms) we have a performance problem with actually retrieving the
documents by code like:
for (int i = nrhits-1; i >=0; i--){
Document hitDoc = hits.doc(i);
String code=hitDoc.get("code");
... statistics
}
Even when restricting nrhits to 2000, we have to wait 10..20 seconds just
for the retrieval. Since the documents are so short we would have expected
a quicker retrieval. BtW the loop was done in inverse order in the hope to
accelerate the retrieval.
We are using Lucene 1.4.3 Java version on a Windows PC.
Would you recommend using the C version ? I suppose it is stable and we
can reuse the database ? Any other suggestions ?
Thanks for your help !
Wolfgang