Mailing List Archive: Notes on distributed searching with Lucene

Notes on distributed searching with Lucene

Mar 24, 2002, 5:49 AM

Post #1 of 8 (875 views)

I have written up some of my experiences with creating a distributed system
with Lucene here:

http://home.clara.net/markharwood/lucene/

It includes some UML interaction diagrams that I found useful in understanding
the Lucene codebase.

Cheers
Mark

--
To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>

RE: Notes on distributed searching with Lucene [ In reply to ]

DCutting at grandcentral

Mar 25, 2002, 2:24 PM

Post #2 of 8 (849 views)

Permalink

> From: Mark Harwood [mailto:markharwood@totalise.co.uk]
>
> I have written up some of my experiences with creating a
> distributed system
> with Lucene here:
>
> http://home.clara.net/markharwood/lucene/
>
> It includes some UML interaction diagrams that I found useful
> in understanding
> the Lucene codebase.

Mark,

It's great to see someone experimenting with this. I originally had
distributed searching in mind when I wrote Lucene, but never quite got to
adding it. A message that mentions some of these intentions is at:
http://www.mail-archive.com/lucene-dev@jakarta.apache.org/msg00252.html

A less "chatty" interface than the one mentioned there might be:

public interface Searchable {
public class TermStatistics implements Serializable {
public int[] docFreqs;
public int maxDoc;
}
int getTermStatistics(Term[] terms) throws IOException;
TopDocs search(Query query, Filter filter, int n) throws IOException;
Document[] getDocs(int[] i) throws IOException;
}

With these three phases (collect term statistics, get doc id scores, get
docs) the results should be identical to searching the indexes locally with
MultiSearcher. It sounded like your experiments skipped the first phase.

Probably it would be worth writing a MultiThreadSearcher that spawns a
thread for each sub-search, then waits for all to finish before merging the
results.

So, if you are able to work on this more, it would be great to figure out
what it would take to make Query serializable, to convert the Searcher
implementations to use the above interface in place of the existing similar
abstract methods, and finally to implement an RMI-based RemoteSearcher.

Doug

--
To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>

RE: Notes on distributed searching with Lucene [ In reply to ]

DCutting at grandcentral

Mar 25, 2002, 3:03 PM

Post #3 of 8 (853 views)

Permalink

> From: Scott Ganyo [mailto:scott.ganyo@eTapestry.com]
>
> But this:
>
> Document[] getDocs(int[] i) throws IOException;
>
> still retrieves full documents from the remote index.

In my thinking, this would only be called for documents that are explicitly
requested with Hits.doc(). I was not thinking that distributed search would
support the "low-level" interface, but just the Hits interface. For each
search, two calls would be made per remote index, one to get query term
statistics, and one to get the top-scoring document numbers and scores.
These can be merged, and then only the globally top-scoring document objects
need be retrieved, as they are displayed.

Doug

--
To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>

RE: Notes on distributed searching with Lucene [ In reply to ]

scott.ganyo at eTapestry

Mar 25, 2002, 3:12 PM

Post #4 of 8 (855 views)

Permalink

But this:

Document[] getDocs(int[] i) throws IOException;

still retrieves full documents from the remote index. One thing that I had
started to look at with remote indexes is an interface that looked like
this:

public IndexHitCollector search(Query query, Filter filter,
IndexHitCollector collector) throws IOException;

where IndexHitCollector looks like this:

public abstract class IndexHitCollector
extends HitCollector
implements Serializable
{
protected transient Searcher m_searcher;

public void setSearcher(Searcher searcher)
{
m_searcher = searcher;
}

abstract public void collect(int doc, float score);
}

This allows one to return less than full Documents (ie. just fields or
whatever) from a remote query as well as perhaps do other ranking,
filtering, and gathering within the collector. This interface was for a
single remote index, of course, but the basic idea being that in a remote
scenario it is best to move control as close to the data as possible to
avoid as many remote calls and transmission of excess data as possible,
don't you agree?

Scott

P.S. Good job, Mark!

> -----Original Message-----
> From: Doug Cutting [mailto:DCutting@grandcentral.com]
> Sent: Monday, March 25, 2002 4:25 PM
> To: 'Lucene Developers List'
> Subject: RE: Notes on distributed searching with Lucene
>
>
> > From: Mark Harwood [mailto:markharwood@totalise.co.uk]
> >
> > I have written up some of my experiences with creating a
> > distributed system
> > with Lucene here:
> >
> > http://home.clara.net/markharwood/lucene/
> >
> > It includes some UML interaction diagrams that I found useful
> > in understanding
> > the Lucene codebase.
>
> Mark,
>
> It's great to see someone experimenting with this. I originally had
> distributed searching in mind when I wrote Lucene, but never
> quite got to
> adding it. A message that mentions some of these intentions is at:
>
> http://www.mail-archive.com/lucene-dev@jakarta.apache.org/msg0
0252.html

A less "chatty" interface than the one mentioned there might be:

public interface Searchable {
public class TermStatistics implements Serializable {
public int[] docFreqs;
public int maxDoc;
}
int getTermStatistics(Term[] terms) throws IOException;
TopDocs search(Query query, Filter filter, int n) throws IOException;
Document[] getDocs(int[] i) throws IOException;
}

With these three phases (collect term statistics, get doc id scores, get
docs) the results should be identical to searching the indexes locally with
MultiSearcher. It sounded like your experiments skipped the first phase.

Probably it would be worth writing a MultiThreadSearcher that spawns a
thread for each sub-search, then waits for all to finish before merging the
results.

So, if you are able to work on this more, it would be great to figure out
what it would take to make Query serializable, to convert the Searcher
implementations to use the above interface in place of the existing similar
abstract methods, and finally to implement an RMI-based RemoteSearcher.

Doug

--
To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>

Re: Notes on distributed searching with Lucene [ In reply to ]

reyes at charabia

Mar 25, 2002, 3:26 PM

Post #5 of 8 (852 views)

Permalink

Hello Mark,

> I have written up some of my experiences with creating a distributed
system
> with Lucene here:
> http://home.clara.net/markharwood/lucene/
> It includes some UML interaction diagrams that I found useful in
understanding
> the Lucene codebase.

Very interesting work ! Some quick comments:

* First, a very minor detail: your diagrams are quite big, I'd suggest
removing the stub/skel instances, for educational purpose (moreover, rmi is
supposed to make them transparent for the developer), and the anonymous
classes as well.

* I am not sure RMI is the best way of making a distributed search engine
scalable. What about a messaging service such as JMS in order to cope with
the scalability and bottle-necking problems that come along with the index
readers/writers.

* I guess I roughly understand the problematics of a distibuted search
engine, but it is not clear for me what exactly is distributed in it ? I
mean, how is the data partitioned in the distributed system ? Should it be
randomly distributed (which implies sending the query to all the nodes that
host an index), or is there a possibility to distribute it according to some
rule, for example per field, or per class of document (in which case, the
query is sent to a subset of the indexing nodes only).

I wish there were such sequence diagram for the index package as well :-)

Rodrigo

--
To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>

RE: Notes on distributed searching with Lucene [ In reply to ]

DCutting at grandcentral

Mar 25, 2002, 3:29 PM

Post #6 of 8 (850 views)

Permalink

> From: Dmitry Serebrennikov [mailto:dmitrys@earthlink.net]
>
> I think Scott's point was that retrieving documents is still too much
> work and perhaps only a few fields need be retrieved. For example, if
> one wanted to present a search results page with titles and summaries
> that's all one would need, whereas documents might also
> contain the full
> text of the document or other stored fields for other types
> of processing.

But if that's the case, then retrieving full documents is probably too slow
locally as well, and probably more fields are being stored in Lucene than is
advisable.

> Another point is that some hit collectors choose to retrieve
> documents
> during scoring, however expensive that may be, in order to do some
> custom scoring or sorting or whatever. In this case, it would
> also help
> if such collectors could be "shipped" over to where the index resides
> and do their job there, so that at least they don't have to move the
> documents acorss the wire.

Good point. My goal was to efficiently distribute Hits-based searching.
Optimizing other sorts of searching might require a different API.

Doug

--
To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>

Re: Notes on distributed searching with Lucene [ In reply to ]

dmitrys at earthlink

Mar 25, 2002, 3:32 PM

Post #7 of 8 (850 views)

Permalink

Doug Cutting wrote:

>>From: Scott Ganyo [mailto:scott.ganyo@eTapestry.com]
>>
>>But this:
>>
>>Document[] getDocs(int[] i) throws IOException;
>>
>>still retrieves full documents from the remote index.
>>
>
>In my thinking, this would only be called for documents that are explicitly
>requested with Hits.doc(). I was not thinking that distributed search would
>support the "low-level" interface, but just the Hits interface. For each
>search, two calls would be made per remote index, one to get query term
>statistics, and one to get the top-scoring document numbers and scores.
>These can be merged, and then only the globally top-scoring document objects
>need be retrieved, as they are displayed.
>
I think Scott's point was that retrieving documents is still too much
work and perhaps only a few fields need be retrieved. For example, if
one wanted to present a search results page with titles and summaries
that's all one would need, whereas documents might also contain the full
text of the document or other stored fields for other types of processing.

Another point is that some hit collectors choose to retrieve documents
during scoring, however expensive that may be, in order to do some
custom scoring or sorting or whatever. In this case, it would also help
if such collectors could be "shipped" over to where the index resides
and do their job there, so that at least they don't have to move the
documents acorss the wire.

Dmitry

--
To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>

Re: Notes on distributed searching with Lucene [ In reply to ]

markharwood at totalise

Mar 26, 2002, 3:43 AM

Post #8 of 8 (852 views)

Permalink

Good to see that this has promoted some discussion :)

Here is some feedback on some of the questions this has raised:

1) My application is able to parition the indexes according to some
application-specific data. Each application would have to have its own scheme
for partitioning.
2) The stubs/skeletons described in my sequence diagram are not the actual RMI
stubs/skeletons used (bad naming on my part) but are essential adapters
required to make up for some limitations in the current lucene classes.

Since implementing the initial design I am now more interested in genericising
it and making use of mobile Java code and the visitor pattern to create a very
powerful and flexible infrastructure. The remote interface for each of the
index servers becomes very simple:

public interface RemoteIndex extends Remote
{
public IndexVisitor visit(IndexVisitor visitor) throws RemoteException;
}

The index Visitor interface is for serializable Java classes that are
"invited" into each index server and given local access to the index through a
"visit" method:

public interface IndexVisitor
{
public void visit(LocalIndexInterface localIndex);
}

Once inside the server different classes of visitor could conceivably perform
any of the following activities:
* Perform a query
* Gather admin stats (memory usage, index size)
* Alter server characteristics (switch to using RAMDirectory?)
* Deposit and collect named "listeners":
Each index could use the "observer" pattern to allow visitors to plug their
choice of listener classes into the core engine. These local listeners could
perform tasks such as monitoring the search terms used in queries being
performed ("top 10 searches"?)or monitor average response times. If the
listeners are named they could be deposited by a visitor and subsequently
collected by a different visitor some time later. RMI would also allow these
listeners to report back to a remote object eg for centralised logging.

//example pseudo code interface....
public interface LocalIndexInterface
{
public void close();
public void optimize();
public void search(....);

public void registerSearchListener(SearchListener sl, String name);
public SearchListener retrieveSearchListener(String name);
public void unregisterSearchListener(SearchListener sl, String name);

public void registerNewDocListener(NewDocListener nl, String name);
//....plus other listener hooks for index activity
}

This of course gives the power to deploy (and undeploy) new java code to use,
change or monitor the behaviour of a whole server farm of indexes without
stoppage, recompilation or redeployment.
Cool.
This level of abstraction offers great flexibility but would need to be
balanced (same old story) with performance requirements that typically favour
lower-level constructs without abstractions.

--
To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>