Hi Doug,
we discussed the need of such a tool several times internally and
developed some workarounds for nutch, so I would be definitely
interested to contribute to such a project.
Having a separated project that depends on hadoop would be the best
case for our usecases.
Best,
Stefan
Am 18.10.2006 um 23:35 schrieb Doug Cutting:
> FYI, I just pitched a new project you might be interested in on
> general@lucene.com. Dunno if you subscribe to that list, so I'm
> spamming you. If it sounds interesting, please reply there. My
> management at Y! is interested in this, so I'm 'in'.
>
> Doug
>
> -------- Original Message --------
> Subject: [PROPOSAL] index server project
> Date: Wed, 18 Oct 2006 14:17:30 -0700
> From: Doug Cutting <cutting@apache.org>
> Reply-To: general@lucene.apache.org
> To: general@lucene.apache.org
>
> It seems that Nutch and Solr would benefit from a shared index serving
> infrastructure. Other Lucene-based projects might also benefit from
> this. So perhaps we should start a new project to build such a thing.
> This could start either in java/contrib, or as a separate sub-project,
> depending on interest.
>
> Here are some quick ideas about how this might work.
>
> An RPC mechanism would be used to communicate between nodes (probably
> Hadoop's). The system would be configured with a single master node
> that keeps track of where indexes are located, and a number of slave
> nodes that would maintain, search and replicate indexes. Clients
> would
> talk to the master to find out which indexes to search or update, then
> they'll talk directly to slaves to perform searches and updates.
>
> Following is an outline of how this might look.
>
> We assume that, within an index, a file with a given name is written
> only once. Index versions are sets of files, and a new version of an
> index is likely to share most files with the prior version. Versions
> are numbered. An index server should keep old versions of each index
> for a while, not immediately removing old files.
>
> public class IndexVersion {
> String Id; // unique name of the index
> int version; // the version of the index
> }
>
> public class IndexLocation {
> IndexVersion indexVersion;
> InetSocketAddress location;
> }
>
> public interface ClientToMasterProtocol {
> IndexLocation[] getSearchableIndexes();
> IndexLocation getUpdateableIndex(String id);
> }
>
> public interface ClientToSlaveProtocol {
> // normal update
> void addDocument(String index, Document doc);
> int[] removeDocuments(String index, Term term);
> void commitVersion(String index);
>
> // batch update
> void addIndex(String index, IndexLocation indexToAdd);
>
> // search
> SearchResults search(IndexVersion i, Query query, Sort sort, int n);
> }
>
> public interface SlaveToMasterProtocol {
> // sends currently searchable indexes
> // recieves updated indexes that we should replicate/update
> public IndexLocation[] heartbeat(IndexVersion[] searchableIndexes);
> }
>
> public interface SlaveToSlaveProtocol {
> String[] getFileSet(IndexVersion indexVersion);
> byte[] getFileContent(IndexVersion indexVersion, String file);
> // based on experience in Hadoop, we probably wouldn't really use
> // RPC to send file content, but rather HTTP.
> }
>
> The master thus maintains the set of indexes that are available for
> search, keeps track of which slave should handle changes to an
> index and
> initiates index synchronization between slaves. The master can be
> configured to replicate indexes a specified number of times.
>
> The client library can cache the current set of searchable indexes and
> periodically refresh it. Searches are broadcast to one index with
> each
> id and return merged results. The client will load-balance both
> searches and updates.
>
> Deletions could be broadcast to all slaves. That would probably be
> fast
> enough. Alternately, indexes could be partitioned by a hash of each
> document's unique id, permitting deletions to be routed to the
> appropriate slave.
>
> Does this make sense? Does it sound like it would be useful to Solr?
> To Nutch? To others? Who would be interested and able to work on it?
>
> Doug
>
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
101tec Inc.
search tech for web 2.1
Menlo Park, California
http://www.101tec.com
we discussed the need of such a tool several times internally and
developed some workarounds for nutch, so I would be definitely
interested to contribute to such a project.
Having a separated project that depends on hadoop would be the best
case for our usecases.
Best,
Stefan
Am 18.10.2006 um 23:35 schrieb Doug Cutting:
> FYI, I just pitched a new project you might be interested in on
> general@lucene.com. Dunno if you subscribe to that list, so I'm
> spamming you. If it sounds interesting, please reply there. My
> management at Y! is interested in this, so I'm 'in'.
>
> Doug
>
> -------- Original Message --------
> Subject: [PROPOSAL] index server project
> Date: Wed, 18 Oct 2006 14:17:30 -0700
> From: Doug Cutting <cutting@apache.org>
> Reply-To: general@lucene.apache.org
> To: general@lucene.apache.org
>
> It seems that Nutch and Solr would benefit from a shared index serving
> infrastructure. Other Lucene-based projects might also benefit from
> this. So perhaps we should start a new project to build such a thing.
> This could start either in java/contrib, or as a separate sub-project,
> depending on interest.
>
> Here are some quick ideas about how this might work.
>
> An RPC mechanism would be used to communicate between nodes (probably
> Hadoop's). The system would be configured with a single master node
> that keeps track of where indexes are located, and a number of slave
> nodes that would maintain, search and replicate indexes. Clients
> would
> talk to the master to find out which indexes to search or update, then
> they'll talk directly to slaves to perform searches and updates.
>
> Following is an outline of how this might look.
>
> We assume that, within an index, a file with a given name is written
> only once. Index versions are sets of files, and a new version of an
> index is likely to share most files with the prior version. Versions
> are numbered. An index server should keep old versions of each index
> for a while, not immediately removing old files.
>
> public class IndexVersion {
> String Id; // unique name of the index
> int version; // the version of the index
> }
>
> public class IndexLocation {
> IndexVersion indexVersion;
> InetSocketAddress location;
> }
>
> public interface ClientToMasterProtocol {
> IndexLocation[] getSearchableIndexes();
> IndexLocation getUpdateableIndex(String id);
> }
>
> public interface ClientToSlaveProtocol {
> // normal update
> void addDocument(String index, Document doc);
> int[] removeDocuments(String index, Term term);
> void commitVersion(String index);
>
> // batch update
> void addIndex(String index, IndexLocation indexToAdd);
>
> // search
> SearchResults search(IndexVersion i, Query query, Sort sort, int n);
> }
>
> public interface SlaveToMasterProtocol {
> // sends currently searchable indexes
> // recieves updated indexes that we should replicate/update
> public IndexLocation[] heartbeat(IndexVersion[] searchableIndexes);
> }
>
> public interface SlaveToSlaveProtocol {
> String[] getFileSet(IndexVersion indexVersion);
> byte[] getFileContent(IndexVersion indexVersion, String file);
> // based on experience in Hadoop, we probably wouldn't really use
> // RPC to send file content, but rather HTTP.
> }
>
> The master thus maintains the set of indexes that are available for
> search, keeps track of which slave should handle changes to an
> index and
> initiates index synchronization between slaves. The master can be
> configured to replicate indexes a specified number of times.
>
> The client library can cache the current set of searchable indexes and
> periodically refresh it. Searches are broadcast to one index with
> each
> id and return merged results. The client will load-balance both
> searches and updates.
>
> Deletions could be broadcast to all slaves. That would probably be
> fast
> enough. Alternately, indexes could be partitioned by a hash of each
> document's unique id, permitting deletions to be routed to the
> appropriate slave.
>
> Does this make sense? Does it sound like it would be useful to Solr?
> To Nutch? To others? Who would be interested and able to work on it?
>
> Doug
>
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
101tec Inc.
search tech for web 2.1
Menlo Park, California
http://www.101tec.com