Hi all,
Thanks for all your continuing help! I have got the go ahead to
build a production-level prototype of my project. I have to be able
to serve several 100s of queries a second (on big boxes), and I'm
currently getting 2 or 3 seconds/query with a sloppy phrase match. I
was trying to profile my usercode, and I saw that it is the
uniqification loop that is killing me.
In my application, I have to be able to return a list of documents,
that have been uniqified according to an accountID. The most relevant
document for an accountID is returned, and then susequent hits that
have the same accountID are dropped.
So, in a recent search of an 8 million document index, I got around
200? hits sloppy phrase hits, and I needed to weed out the duplicates.
so the pseudo code is:
while ( i < hits.length && resultSet.size < 40) {
accountID = doc(i).get("accountID");
if hashtable.get(accountID) != null continue;
else insert accountID in hashtable, add result to resultSet.
}
I timed it, and I was getting about 60 msecs each time round that
loop, which makes me suspect the doc(i).get().
This seems to be really inefficient (the query is a sloppy Phrase
matcher). Any ideas how I can speed this up? I'm obviously going to
try a RAMDirectory version, but it seems that the 60msec delay is
over the top ?
I guess the short version of this is
(a) Is there a way to do this uniqification somehow in the index itself ?
(b) or have a special kind of field which is ultrafast to access given "i" ?
(c) or anyway to speed up the existing behaviour!
Cheers,
Winton
Winton Davies
Lead Engineer, Overture (NSDQ: OVER)
1820 Gateway Drive, Suite 360
San Mateo, CA 94404
work: (650) 403-2259
cell: (650) 867-1598
http://www.overture.com/
--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>
Thanks for all your continuing help! I have got the go ahead to
build a production-level prototype of my project. I have to be able
to serve several 100s of queries a second (on big boxes), and I'm
currently getting 2 or 3 seconds/query with a sloppy phrase match. I
was trying to profile my usercode, and I saw that it is the
uniqification loop that is killing me.
In my application, I have to be able to return a list of documents,
that have been uniqified according to an accountID. The most relevant
document for an accountID is returned, and then susequent hits that
have the same accountID are dropped.
So, in a recent search of an 8 million document index, I got around
200? hits sloppy phrase hits, and I needed to weed out the duplicates.
so the pseudo code is:
while ( i < hits.length && resultSet.size < 40) {
accountID = doc(i).get("accountID");
if hashtable.get(accountID) != null continue;
else insert accountID in hashtable, add result to resultSet.
}
I timed it, and I was getting about 60 msecs each time round that
loop, which makes me suspect the doc(i).get().
This seems to be really inefficient (the query is a sloppy Phrase
matcher). Any ideas how I can speed this up? I'm obviously going to
try a RAMDirectory version, but it seems that the 60msec delay is
over the top ?
I guess the short version of this is
(a) Is there a way to do this uniqification somehow in the index itself ?
(b) or have a special kind of field which is ultrafast to access given "i" ?
(c) or anyway to speed up the existing behaviour!
Cheers,
Winton
Winton Davies
Lead Engineer, Overture (NSDQ: OVER)
1820 Gateway Drive, Suite 360
San Mateo, CA 94404
work: (650) 403-2259
cell: (650) 867-1598
http://www.overture.com/
--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>