Mailing List Archive: Parallelising a query...

Parallelising a query...

Nov 28, 2001, 10:09 PM

Post #1 of 7 (1741 views)

Hi,

Let say I want to retrieve all relevant listings for a query (just
suppose)...

I have 4 million documents... I could:

Split these into 4 x 1 million document indexes and then send a
query to 4 Lucene processes ? At the end I would have to sort the
results by relevance.

Question for Doug or any other Search Engine guru -- would this
reduce the time to find these results by 75% ?

I know it is probably a hard question to answer (i.e. all the
documents that match, might just be in one process...) but I'm more
getting at the average length of the inverted indexes that have to be
joined being reduced by 75%, hence the join should take only 25% of
the time...

Any thoughts on this idiocy ? Reason why I ask ? Well, lets say I
can't fit a 4 million document RamDir index into 1GB heap space, but
I could if I split it up :) ?

Cheers,
Winton

Winton Davies
Lead Engineer, Overture (NSDQ: OVER)
1820 Gateway Drive, Suite 360
San Mateo, CA 94404
work: (650) 403-2259
cell: (650) 867-1598
http://www.overture.com/

--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>

RE: Parallelising a query... [ In reply to ]

DCutting at grandcentral

Nov 29, 2001, 9:52 AM

Post #2 of 7 (1712 views)

Permalink

> From: Winton Davies [mailto:wdavies@overture.com]
>
> I have 4 million documents... I could:
>
> Split these into 4 x 1 million document indexes and then send a
> query to 4 Lucene processes ? At the end I would have to sort the
> results by relevance.
>
> Question for Doug or any other Search Engine guru -- would this
> reduce the time to find these results by 75% ?

It could, if you have four processors and four disk drives and things work
out optimally.

If you have a single machine with multiple processors and/or a disk array,
and your CPU or i/o are not already maxed out, then multi-threading is a
good way to make searches faster. To implement this I would write something
like MultiSearcher, but that runs each sub-search in a separate thread, a
ThreadedMultiSearcher.

If you instead have several machines that you would like to spread search
load over, then you could use RMI to send queries to these machines. I
would first implement the single-machine version, ThreadedMultiSearcher,
then implement a RemoteSearcher class, that forwards Searcher methods via
RMI to a Searcher object on another machine. Then to spread load across
machines, construct a ThreadedMultiSearcher and populate it with
RemoteSearcher instances pointing at the different machines.

The Searcher API was designed with this sort of thing in mind. Note though
that HitCollector-based searching is not a good candidate for RMI, since it
does a callback for every document. Stick to the TopDocs-based search
method. You'll also need to forward docFreq(Term) and maxDoc(), used to
weight the query before searching, and doc(int), used to fetch hit
documents. Probably these should be abstracted into a separate interface,
Searchable.

Doug

--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>

RE: Parallelising a query... [ In reply to ]

wdavies at overture

Nov 29, 2001, 12:04 PM

Post #3 of 7 (1719 views)

Permalink

Hi Doug,

Thank you again for your wisdom. Yep, we'll probably be running on a
quad E420 or two.

I think that I'll get better performance with one virtual searcher,
spread over 4 CPU's (and < 1 GB ram each) rather than 4 monolithics
searchers each with its own index (actually I don't think I could get
4 on the machine --- the index is nearly a Gig each).

Winton

Winton Davies
Lead Engineer, Overture (NSDQ: OVER)
1820 Gateway Drive, Suite 360
San Mateo, CA 94404
work: (650) 403-2259
cell: (650) 867-1598
http://www.overture.com/

--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>

Re: Parallelising a query... [ In reply to ]

wdavies at overture

Nov 29, 2001, 12:12 PM

Post #4 of 7 (1722 views)

Permalink

Hi again....

Another dumb question :) (actually I'm too busy to look at the code :) )

In the index, is the datastructure of termDocs (is that the right
term), sorted by anything? Or is it just insertion order ? I could
see how one might want to sort by the Doc with the highest term
frequency ? But I can also see why
it might not help.

e.g. Token1 -> doc1 (2) [occurences] -> doc2 (6) -> doc3 (3)

or is it like this ?

Token1 -> doc2 (6) -> doc3 (3) -> doc1 (2) ?

I have an idea for an optimization I want to make, but I'm not sure
exactly whether it is warrants investigation.

Winton

Winton Davies
Lead Engineer, Overture (NSDQ: OVER)
1820 Gateway Drive, Suite 360
San Mateo, CA 94404
work: (650) 403-2259
cell: (650) 867-1598
http://www.overture.com/

--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>

RE: Parallelising a query... [ In reply to ]

DCutting at grandcentral

Nov 29, 2001, 2:39 PM

Post #5 of 7 (1715 views)

Permalink

TermDocs are ordered by document number. It would not be easy to change
this.

Doug

> -----Original Message-----
> From: Winton Davies [mailto:wdavies@overture.com]
> Sent: Thursday, November 29, 2001 11:12 AM
> To: Lucene Users List
> Subject: Re: Parallelising a query...
>
>
> Hi again....
>
> Another dumb question :) (actually I'm too busy to look at
> the code :) )
>
> In the index, is the datastructure of termDocs (is that the right
> term), sorted by anything? Or is it just insertion order ? I could
> see how one might want to sort by the Doc with the highest term
> frequency ? But I can also see why
> it might not help.
>
> e.g. Token1 -> doc1 (2) [occurences] -> doc2 (6) -> doc3 (3)
>
> or is it like this ?
>
> Token1 -> doc2 (6) -> doc3 (3) -> doc1 (2) ?
>
>
> I have an idea for an optimization I want to make, but I'm not sure
> exactly whether it is warrants investigation.
>
> Winton
>
>
> Winton Davies
> Lead Engineer, Overture (NSDQ: OVER)
> 1820 Gateway Drive, Suite 360
> San Mateo, CA 94404
> work: (650) 403-2259
> cell: (650) 867-1598
> http://www.overture.com/
>
> --
> To unsubscribe, e-mail:
<mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail:
<mailto:lucene-user-help@jakarta.apache.org>

--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>

RE: Parallelising a query... [ In reply to ]

wdavies at overture

Nov 29, 2001, 3:06 PM

Post #6 of 7 (1716 views)

Permalink

Thanks ! (one thing to cross off my list of optimizations...)

Cheers,
Winton

Winton Davies
Lead Engineer, Overture (NSDQ: OVER)
1820 Gateway Drive, Suite 360
San Mateo, CA 94404
work: (650) 403-2259
cell: (650) 867-1598
http://www.overture.com/

--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>

RE: Parallelising a query... [ In reply to ]

wdavies at overture

Nov 29, 2001, 3:08 PM

Post #7 of 7 (1713 views)

Permalink

Whoops, must think before I write...

I think this is *good* for me.

Let's say I have a default sort property I want to have:

If I only do a one-time Index, if I sort the input documents in the
order I want before indexing, then the results will come out in that
sort order -- assuming I implement my own hit collector (basically
just ignore the score...)

Winton

Winton Davies
Lead Engineer, Overture (NSDQ: OVER)
1820 Gateway Drive, Suite 360
San Mateo, CA 94404
work: (650) 403-2259
cell: (650) 867-1598
http://www.overture.com/

--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>