Mailing List Archive

Parallelising a query...
Hi,

Let say I want to retrieve all relevant listings for a query (just
suppose)...

I have 4 million documents... I could:

Split these into 4 x 1 million document indexes and then send a
query to 4 Lucene processes ? At the end I would have to sort the
results by relevance.

Question for Doug or any other Search Engine guru -- would this
reduce the time to find these results by 75% ?

I know it is probably a hard question to answer (i.e. all the
documents that match, might just be in one process...) but I'm more
getting at the average length of the inverted indexes that have to be
joined being reduced by 75%, hence the join should take only 25% of
the time...

Any thoughts on this idiocy ? Reason why I ask ? Well, lets say I
can't fit a 4 million document RamDir index into 1GB heap space, but
I could if I split it up :) ?

Cheers,
Winton






Winton Davies
Lead Engineer, Overture (NSDQ: OVER)
1820 Gateway Drive, Suite 360
San Mateo, CA 94404
work: (650) 403-2259
cell: (650) 867-1598
http://www.overture.com/

--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>
RE: Parallelising a query... [ In reply to ]
> From: Winton Davies [mailto:wdavies@overture.com]
>
> I have 4 million documents... I could:
>
> Split these into 4 x 1 million document indexes and then send a
> query to 4 Lucene processes ? At the end I would have to sort the
> results by relevance.
>
> Question for Doug or any other Search Engine guru -- would this
> reduce the time to find these results by 75% ?

It could, if you have four processors and four disk drives and things work
out optimally.

If you have a single machine with multiple processors and/or a disk array,
and your CPU or i/o are not already maxed out, then multi-threading is a
good way to make searches faster. To implement this I would write something
like MultiSearcher, but that runs each sub-search in a separate thread, a
ThreadedMultiSearcher.

If you instead have several machines that you would like to spread search
load over, then you could use RMI to send queries to these machines. I
would first implement the single-machine version, ThreadedMultiSearcher,
then implement a RemoteSearcher class, that forwards Searcher methods via
RMI to a Searcher object on another machine. Then to spread load across
machines, construct a ThreadedMultiSearcher and populate it with
RemoteSearcher instances pointing at the different machines.

The Searcher API was designed with this sort of thing in mind. Note though
that HitCollector-based searching is not a good candidate for RMI, since it
does a callback for every document. Stick to the TopDocs-based search
method. You'll also need to forward docFreq(Term) and maxDoc(), used to
weight the query before searching, and doc(int), used to fetch hit
documents. Probably these should be abstracted into a separate interface,
Searchable.

Doug

--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>
RE: Parallelising a query... [ In reply to ]
Hi Doug,

Thank you again for your wisdom. Yep, we'll probably be running on a
quad E420 or two.

I think that I'll get better performance with one virtual searcher,
spread over 4 CPU's (and < 1 GB ram each) rather than 4 monolithics
searchers each with its own index (actually I don't think I could get
4 on the machine --- the index is nearly a Gig each).

Winton

Winton Davies
Lead Engineer, Overture (NSDQ: OVER)
1820 Gateway Drive, Suite 360
San Mateo, CA 94404
work: (650) 403-2259
cell: (650) 867-1598
http://www.overture.com/

--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>
Re: Parallelising a query... [ In reply to ]
Hi again....

Another dumb question :) (actually I'm too busy to look at the code :) )

In the index, is the datastructure of termDocs (is that the right
term), sorted by anything? Or is it just insertion order ? I could
see how one might want to sort by the Doc with the highest term
frequency ? But I can also see why
it might not help.

e.g. Token1 -> doc1 (2) [occurences] -> doc2 (6) -> doc3 (3)

or is it like this ?

Token1 -> doc2 (6) -> doc3 (3) -> doc1 (2) ?


I have an idea for an optimization I want to make, but I'm not sure
exactly whether it is warrants investigation.

Winton


Winton Davies
Lead Engineer, Overture (NSDQ: OVER)
1820 Gateway Drive, Suite 360
San Mateo, CA 94404
work: (650) 403-2259
cell: (650) 867-1598
http://www.overture.com/

--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>
RE: Parallelising a query... [ In reply to ]
TermDocs are ordered by document number. It would not be easy to change
this.

Doug

> -----Original Message-----
> From: Winton Davies [mailto:wdavies@overture.com]
> Sent: Thursday, November 29, 2001 11:12 AM
> To: Lucene Users List
> Subject: Re: Parallelising a query...
>
>
> Hi again....
>
> Another dumb question :) (actually I'm too busy to look at
> the code :) )
>
> In the index, is the datastructure of termDocs (is that the right
> term), sorted by anything? Or is it just insertion order ? I could
> see how one might want to sort by the Doc with the highest term
> frequency ? But I can also see why
> it might not help.
>
> e.g. Token1 -> doc1 (2) [occurences] -> doc2 (6) -> doc3 (3)
>
> or is it like this ?
>
> Token1 -> doc2 (6) -> doc3 (3) -> doc1 (2) ?
>
>
> I have an idea for an optimization I want to make, but I'm not sure
> exactly whether it is warrants investigation.
>
> Winton
>
>
> Winton Davies
> Lead Engineer, Overture (NSDQ: OVER)
> 1820 Gateway Drive, Suite 360
> San Mateo, CA 94404
> work: (650) 403-2259
> cell: (650) 867-1598
> http://www.overture.com/
>
> --
> To unsubscribe, e-mail:
<mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail:
<mailto:lucene-user-help@jakarta.apache.org>

--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>
RE: Parallelising a query... [ In reply to ]
Thanks ! (one thing to cross off my list of optimizations...)

Cheers,
Winton

Winton Davies
Lead Engineer, Overture (NSDQ: OVER)
1820 Gateway Drive, Suite 360
San Mateo, CA 94404
work: (650) 403-2259
cell: (650) 867-1598
http://www.overture.com/

--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>
RE: Parallelising a query... [ In reply to ]
Whoops, must think before I write...

I think this is *good* for me.

Let's say I have a default sort property I want to have:

If I only do a one-time Index, if I sort the input documents in the
order I want before indexing, then the results will come out in that
sort order -- assuming I implement my own hit collector (basically
just ignore the score...)

Winton

Winton Davies
Lead Engineer, Overture (NSDQ: OVER)
1820 Gateway Drive, Suite 360
San Mateo, CA 94404
work: (650) 403-2259
cell: (650) 867-1598
http://www.overture.com/

--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>