Mailing List Archive

Performance problem
Dear all,

we are using Lucene to store 10Mio bilingual sentence pairs for doing some
natural language processing with them. Each documents contains a sentence,
its translation and a topical code. We want to select sentences containing
certain words and do statistics over the topical codes in order to detect
translations which depend on the topic (like key=> Taste (topic: input
devices), key=> Schlüssel (topic: cryptography)).

While the search is carried out in a reasonably short time (about
500..800ms) we have a performance problem with actually retrieving the
documents by code like:

for (int i = nrhits-1; i >=0; i--){
Document hitDoc = hits.doc(i);
String code=hitDoc.get("code");
... statistics
}

Even when restricting nrhits to 2000, we have to wait 10..20 seconds just
for the retrieval. Since the documents are so short we would have expected
a quicker retrieval. BtW the loop was done in inverse order in the hope to
accelerate the retrieval.

We are using Lucene 1.4.3 Java version on a Windows PC.

Would you recommend using the C version ? I suppose it is stable and we
can reuse the database ? Any other suggestions ?

Thanks for your help !

Wolfgang
Re: Performance problem [ In reply to ]
On Aug 24, 2005, at 3:32 AM, WolfgangTäger wrote:

> Dear all,
>
> we are using Lucene to store 10Mio bilingual sentence pairs for
> doing some
> natural language processing with them. Each documents contains a
> sentence,
> its translation and a topical code. We want to select sentences
> containing
> certain words and do statistics over the topical codes in order to
> detect
> translations which depend on the topic (like key=> Taste (topic: input
> devices), key=> Schlüssel (topic: cryptography)).
>
> While the search is carried out in a reasonably short time (about
> 500..800ms) we have a performance problem with actually retrieving the
> documents by code like:
>
> for (int i = nrhits-1; i >=0; i--){
> Document hitDoc = hits.doc(i);
> String code=hitDoc.get("code");
> ... statistics
> }
>
> Even when restricting nrhits to 2000, we have to wait 10..20
> seconds just
> for the retrieval. Since the documents are so short we would have
> expected
> a quicker retrieval. BtW the loop was done in inverse order in the
> hope to
> accelerate the retrieval.

How many documents are you trying to retrieve? I think you'll have
much better luck if you walked the documents in ascending Hits order
than backwards, as Hits caches documents with the presumption you'll
move forward through them. I'd be curious to see how much (or if)
moving forwards through Hits helps.

Erik
Re: Performance problem [ In reply to ]
Hello Erik,

I tried i++ and the performance is similar.

Maybe the problem is linked to the sorting of the results, because the
required time increases with the number of hits:

Query: DE:Taste => Hits 2k => retrieve first 2000: 3.9sec
Query: needle => Hits 9.5k => retrieve first 2000:
15sec
Query: connection => Hits 78k => retrieve first 2000:
10.1sec
Query: product => Hits 81k => retrieve first 2000:
18.3sec

Wolfgang
Re: Performance problem [ In reply to ]
On Wednesday 24 August 2005 09:32, WolfgangTäger wrote:
> Dear all,
>
> we are using Lucene to store 10Mio bilingual sentence pairs for doing some
> natural language processing with them. Each documents contains a sentence,
> its translation and a topical code. We want to select sentences containing
> certain words and do statistics over the topical codes in order to detect
> translations which depend on the topic (like key=> Taste (topic: input
> devices), key=> Schlüssel (topic: cryptography)).
>
> While the search is carried out in a reasonably short time (about
> 500..800ms) we have a performance problem with actually retrieving the
> documents by code like:
>
> for (int i = nrhits-1; i >=0; i--){
> Document hitDoc = hits.doc(i);
> String code=hitDoc.get("code");
> ... statistics
> }
>
> Even when restricting nrhits to 2000, we have to wait 10..20 seconds just
> for the retrieval. Since the documents are so short we would have expected
> a quicker retrieval. BtW the loop was done in inverse order in the hope to
> accelerate the retrieval.
>
> We are using Lucene 1.4.3 Java version on a Windows PC.
>
> Would you recommend using the C version ? I suppose it is stable and we
> can reuse the database ? Any other suggestions ?

For so much retrieval, it's better to roll your own:
Use the low level search api Searcher.search(Query, HitCollector) to collect
all the hits by doc number, keeping the scores if you need them.
Then sort these doc nrs (they normally are not far from sorted after
collecting), and retrieve all docs in that sorted order by
IndexReader.document(int).
In that way, with a bit of luck, the disk head never needs to change direction
during retrieval, and prefetches by the operating system (if any) stand a lot
better change of actually being used.
In case you don't have the index reader around, open it explicitly
and construct your searcher from it.

Regards,
Paul Elschot
Re: Performance problem [ In reply to ]
WolfgangTäger wrote:
> While the search is carried out in a reasonably short time (about
> 500..800ms) we have a performance problem with actually retrieving the
> documents by code like:
>
> for (int i = nrhits-1; i >=0; i--){
> Document hitDoc = hits.doc(i);
> String code=hitDoc.get("code");
> ... statistics
> }

If you have enough RAM, a FieldCache would make this very fast.

TopDocs hits = searcher.search(query, (Filter)null, 2000);
String[] codes = FieldCache.DEFAULT.getStrings(indexReader, "code");
for (int i = 0; i < hits.scoreDocs.length; i++) {
String code = codes[hits.scoreDocs[i].doc];
...
}

Doug
Re: Performance problem, Search within search for TopDocs ? [ In reply to ]
Hello Paul, hello Doug,

many thanks for your help !!

@Paul: I didn't try your method, Paul, because I feared that retrieving
all hits would not solve my problems, because the number of hits may be
many times higher than the 2000 I wanted for my statistics.


@Doug: I tried your method which is extremely fast !

The first run is still very slow (many seconds, but this is not a problem
for me), measuring the following ones sometimes gives 0ms using
currentTimeMillis() !!


Now I have one more question: I want to use the TopDocs result

TopDocs hits = searcher.search(query, (Filter)null, 2000);

as a filter for following queries like

DE:Schlüssel limited to the 2000 TopDocs for the query "key"

AFAIK, the recommended way for search within a search is:
Combine the previous query with the current query using BooleanQuery,
wherein the previous query is marked as required.

Can this be done in this case ? My query does not contain any information
that I want 2000 results as maximum.
Note: I do not want to limit the combined query to 2000, but typically I
expect fewer results by first restricting the "key" query to 2000 and then
looking in the results for DE:"Schlüssel".

Wolfgang
Re: Performance problem, Search within search for TopDocs ? [ In reply to ]
On Thursday 25 August 2005 09:14, WolfgangTäger wrote:
> Hello Paul, hello Doug,
>
> many thanks for your help !!
>
> @Paul: I didn't try your method, Paul, because I feared that retrieving
> all hits would not solve my problems, because the number of hits may be
> many times higher than the 2000 I wanted for my statistics.
>
>
> @Doug: I tried your method which is extremely fast !
>
> The first run is still very slow (many seconds, but this is not a problem
> for me), measuring the following ones sometimes gives 0ms using
> currentTimeMillis() !!

The method I gave is to try to optimize the first reading from disk.
I'd expect FieldCache to use it.

>
>
> Now I have one more question: I want to use the TopDocs result
>
> TopDocs hits = searcher.search(query, (Filter)null, 2000);
>
> as a filter for following queries like
>
> DE:Schlüssel limited to the 2000 TopDocs for the query "key"
>
> AFAIK, the recommended way for search within a search is:
> Combine the previous query with the current query using BooleanQuery,
> wherein the previous query is marked as required.
>
> Can this be done in this case ? My query does not contain any information
> that I want 2000 results as maximum.
> Note: I do not want to limit the combined query to 2000, but typically I
> expect fewer results by first restricting the "key" query to 2000 and then
> looking in the results for DE:"Schlüssel".

Have a look at QueryFilter and FilteredQuery, they fit nicely here.

Regards,
Paul Elschot
Re: Performance problem, Search within search for TopDocs ? [ In reply to ]
Paul,

the point with QueryFilter and FilteredQuery is that they expect a query
and not a HitDocs.

Wolfgang
Re: Performance problem [ In reply to ]
Hello again,

Since my code field does not contain strings, I've a little problem with
using
getStrings


I probably have to use

FieldCache.StringIndex codes =
FieldCache.DEFAULT.getStringIndex(indexReader, "code");

I however do not understand how to find the code of index i in the
results.

I tried something like
codes.lookup[codes.order[hits.scoreDocs[i].doc]]

Is this correct ?

Wolfgang



If you have enough RAM, a FieldCache would make this very fast.

TopDocs hits = searcher.search(query, (Filter)null, 2000);
String[] codes = FieldCache.DEFAULT.getStrings(indexReader, "code");
for (int i = 0; i < hits.scoreDocs.length; i++) {
String code = codes[hits.scoreDocs[i].doc];
...
}

Doug