Mailing List Archive

Efficient doc information retrieval.
Hi all,

Thanks for all your continuing help! I have got the go ahead to
build a production-level prototype of my project. I have to be able
to serve several 100s of queries a second (on big boxes), and I'm
currently getting 2 or 3 seconds/query with a sloppy phrase match. I
was trying to profile my usercode, and I saw that it is the
uniqification loop that is killing me.

In my application, I have to be able to return a list of documents,
that have been uniqified according to an accountID. The most relevant
document for an accountID is returned, and then susequent hits that
have the same accountID are dropped.

So, in a recent search of an 8 million document index, I got around
200? hits sloppy phrase hits, and I needed to weed out the duplicates.

so the pseudo code is:

while ( i < hits.length && resultSet.size < 40) {
accountID = doc(i).get("accountID");
if hashtable.get(accountID) != null continue;
else insert accountID in hashtable, add result to resultSet.
}

I timed it, and I was getting about 60 msecs each time round that
loop, which makes me suspect the doc(i).get().

This seems to be really inefficient (the query is a sloppy Phrase
matcher). Any ideas how I can speed this up? I'm obviously going to
try a RAMDirectory version, but it seems that the 60msec delay is
over the top ?

I guess the short version of this is

(a) Is there a way to do this uniqification somehow in the index itself ?
(b) or have a special kind of field which is ultrafast to access given "i" ?
(c) or anyway to speed up the existing behaviour!

Cheers,
Winton

Winton Davies
Lead Engineer, Overture (NSDQ: OVER)
1820 Gateway Drive, Suite 360
San Mateo, CA 94404
work: (650) 403-2259
cell: (650) 867-1598
http://www.overture.com/

--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>
Re: Efficient doc information retrieval. [ In reply to ]
Winton Davies wrote:
>
> Hi all,

> In my application, I have to be able to return a list of documents,
> that have been uniqified according to an accountID. The most relevant
> document for an accountID is returned, and then susequent hits that
> have the same accountID are dropped.

Do you mean that certain documents are associated with particular
account IDs? If so, why not include the account ID as part of the query?
Or have I missed something?

Cheers,

Eliot Kimber
ISOGEN International
eliot@isogen.com

--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>
Re: Efficient doc information retrieval. [ In reply to ]
Hi Eliot,

Not really, all documents have an accountID, but I need to search
all the documents
first, and each document that is returned has an accountID, but I
just want one document
per accountID.

so:

doc1 acc1
doc2 acc1
doc3 acc1
doc4 acc2
doc5 acc2
doc6 acc2

Lets say the query "X" returns hits in this order:

doc1
doc2
doc3
doc4
doc5

what I want returned is:

doc1 (best of acc1)
doc4 (best of acc2)

Note that creating a seperate Index for each account is impractical
(30K+ accountID).

Cheers,
Winton



At 17:30 -0600 11/14/01, W. Eliot Kimber wrote:
>Winton Davies wrote:
>>
>> Hi all,
>
>> In my application, I have to be able to return a list of documents,
>> that have been uniqified according to an accountID. The most relevant
>> document for an accountID is returned, and then susequent hits that
>> have the same accountID are dropped.
>
>Do you mean that certain documents are associated with particular
>account IDs? If so, why not include the account ID as part of the query?
>Or have I missed something?
>
>Cheers,
>
>Eliot Kimber
>ISOGEN International
>eliot@isogen.com
>
>--
>To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
>For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>


Winton Davies
Lead Engineer, Overture (NSDQ: OVER)
1820 Gateway Drive, Suite 360
San Mateo, CA 94404
work: (650) 403-2259
cell: (650) 867-1598
http://www.overture.com/

--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>
Re: Efficient doc information retrieval. [ In reply to ]
Winton Davies wrote:
>
> Hi Eliot,
>
> Not really, all documents have an accountID, but I need to search
> all the documents
> first, and each document that is returned has an accountID, but I
> just want one document
> per accountID.

I see the problem. Can't think of any other way to solve it than to
post-process the returned docs as you described.

Eliot Kimber
ISOGEN International
eliot@isogen.com

--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>
Re: Efficient doc information retrieval. [ In reply to ]
Thanks anyway ! Much appreciate you thinking about it ?

Cheers,
Winton

Winton Davies
Lead Engineer, Overture (NSDQ: OVER)
1820 Gateway Drive, Suite 360
San Mateo, CA 94404
work: (650) 403-2259
cell: (650) 867-1598
http://www.overture.com/

--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>
RE: Efficient doc information retrieval. [ In reply to ]
> From: Winton Davies [mailto:wdavies@overture.com]
>
> Not really, all documents have an accountID, but I need to search
> all the documents
> first, and each document that is returned has an accountID, but I
> just want one document
> per accountID.
>
> so:
>
> doc1 acc1
> doc2 acc1
> doc3 acc1
> doc4 acc2
> doc5 acc2
> doc6 acc2
>
> Lets say the query "X" returns hits in this order:
>
> doc1
> doc2
> doc3
> doc4
> doc5
>
> what I want returned is:
>
> doc1 (best of acc1)
> doc4 (best of acc2)

You might try something like:
construct an int[] array that has an "account number" for each document
construct a BitSet to keep track of whether you've seen a hit for each
account
use a HitCollector that uses these as follows:
if (!seen.get(accounts[doc])) {
seen.set(accounts[doc]);
collect(doc);
}
That should be quite fast.

You'll want to construct the accounts array once when you open the index,
perhaps by enumerating some terms that store this information. You'll need
to construct a new "seen" BitSet for each query, or clear a cached one.

Doug

--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>
RE: Efficient doc information retrieval. [ In reply to ]
Thanks so much Doug! Looks clever, I'll try it out :)

Winton

Winton Davies
Lead Engineer, Overture (NSDQ: OVER)
1820 Gateway Drive, Suite 360
San Mateo, CA 94404
work: (650) 403-2259
cell: (650) 867-1598
http://www.overture.com/

--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>