Mailing List Archive: Re: [Fwd: [PROPOSAL] index server project]

Re: [Fwd: [PROPOSAL] index server project]

Oct 19, 2006, 6:55 AM

Post #1 of 4 (1791 views)

Hi Doug,

we discussed the need of such a tool several times internally and
developed some workarounds for nutch, so I would be definitely
interested to contribute to such a project.
Having a separated project that depends on hadoop would be the best
case for our usecases.

Best,
Stefan

Am 18.10.2006 um 23:35 schrieb Doug Cutting:

> FYI, I just pitched a new project you might be interested in on
> general@lucene.com. Dunno if you subscribe to that list, so I'm
> spamming you. If it sounds interesting, please reply there. My
> management at Y! is interested in this, so I'm 'in'.
>
> Doug
>
> -------- Original Message --------
> Subject: [PROPOSAL] index server project
> Date: Wed, 18 Oct 2006 14:17:30 -0700
> From: Doug Cutting <cutting@apache.org>
> Reply-To: general@lucene.apache.org
> To: general@lucene.apache.org
>
> It seems that Nutch and Solr would benefit from a shared index serving
> infrastructure. Other Lucene-based projects might also benefit from
> this. So perhaps we should start a new project to build such a thing.
> This could start either in java/contrib, or as a separate sub-project,
> depending on interest.
>
> Here are some quick ideas about how this might work.
>
> An RPC mechanism would be used to communicate between nodes (probably
> Hadoop's). The system would be configured with a single master node
> that keeps track of where indexes are located, and a number of slave
> nodes that would maintain, search and replicate indexes. Clients
> would
> talk to the master to find out which indexes to search or update, then
> they'll talk directly to slaves to perform searches and updates.
>
> Following is an outline of how this might look.
>
> We assume that, within an index, a file with a given name is written
> only once. Index versions are sets of files, and a new version of an
> index is likely to share most files with the prior version. Versions
> are numbered. An index server should keep old versions of each index
> for a while, not immediately removing old files.
>
> public class IndexVersion {
> String Id; // unique name of the index
> int version; // the version of the index
> }
>
> public class IndexLocation {
> IndexVersion indexVersion;
> InetSocketAddress location;
> }
>
> public interface ClientToMasterProtocol {
> IndexLocation[] getSearchableIndexes();
> IndexLocation getUpdateableIndex(String id);
> }
>
> public interface ClientToSlaveProtocol {
> // normal update
> void addDocument(String index, Document doc);
> int[] removeDocuments(String index, Term term);
> void commitVersion(String index);
>
> // batch update
> void addIndex(String index, IndexLocation indexToAdd);
>
> // search
> SearchResults search(IndexVersion i, Query query, Sort sort, int n);
> }
>
> public interface SlaveToMasterProtocol {
> // sends currently searchable indexes
> // recieves updated indexes that we should replicate/update
> public IndexLocation[] heartbeat(IndexVersion[] searchableIndexes);
> }
>
> public interface SlaveToSlaveProtocol {
> String[] getFileSet(IndexVersion indexVersion);
> byte[] getFileContent(IndexVersion indexVersion, String file);
> // based on experience in Hadoop, we probably wouldn't really use
> // RPC to send file content, but rather HTTP.
> }
>
> The master thus maintains the set of indexes that are available for
> search, keeps track of which slave should handle changes to an
> index and
> initiates index synchronization between slaves. The master can be
> configured to replicate indexes a specified number of times.
>
> The client library can cache the current set of searchable indexes and
> periodically refresh it. Searches are broadcast to one index with
> each
> id and return merged results. The client will load-balance both
> searches and updates.
>
> Deletions could be broadcast to all slaves. That would probably be
> fast
> enough. Alternately, indexes could be partitioned by a hash of each
> document's unique id, permitting deletions to be routed to the
> appropriate slave.
>
> Does this make sense? Does it sound like it would be useful to Solr?
> To Nutch? To others? Who would be interested and able to work on it?
>
> Doug
>

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
101tec Inc.
search tech for web 2.1
Menlo Park, California
http://www.101tec.com

Re: [Fwd: [PROPOSAL] index server project] [ In reply to ]

the.mindstorm.mailinglist at gmail

Oct 19, 2006, 7:19 AM

Post #2 of 4 (1684 views)

Permalink

I am not sure this is (somehow) related, but I think I have noticed
some project on a Sun contest (it was the big prize winner). I cannot
retrieve it now, but hopefully somebody else will.

./alex
--
.w( the_mindstorm )p.

On 10/19/06, Stefan Groschupf <sg@101tec.com> wrote:
> Hi Doug,
>
> we discussed the need of such a tool several times internally and
> developed some workarounds for nutch, so I would be definitely
> interested to contribute to such a project.
> Having a separated project that depends on hadoop would be the best
> case for our usecases.
>
> Best,
> Stefan
>
>
>
> Am 18.10.2006 um 23:35 schrieb Doug Cutting:
>
> > FYI, I just pitched a new project you might be interested in on
> > general@lucene.com. Dunno if you subscribe to that list, so I'm
> > spamming you. If it sounds interesting, please reply there. My
> > management at Y! is interested in this, so I'm 'in'.
> >
> > Doug
> >
> > -------- Original Message --------
> > Subject: [PROPOSAL] index server project
> > Date: Wed, 18 Oct 2006 14:17:30 -0700
> > From: Doug Cutting <cutting@apache.org>
> > Reply-To: general@lucene.apache.org
> > To: general@lucene.apache.org
> >
> > It seems that Nutch and Solr would benefit from a shared index serving
> > infrastructure. Other Lucene-based projects might also benefit from
> > this. So perhaps we should start a new project to build such a thing.
> > This could start either in java/contrib, or as a separate sub-project,
> > depending on interest.
> >
> > Here are some quick ideas about how this might work.
> >
> > An RPC mechanism would be used to communicate between nodes (probably
> > Hadoop's). The system would be configured with a single master node
> > that keeps track of where indexes are located, and a number of slave
> > nodes that would maintain, search and replicate indexes. Clients
> > would
> > talk to the master to find out which indexes to search or update, then
> > they'll talk directly to slaves to perform searches and updates.
> >
> > Following is an outline of how this might look.
> >
> > We assume that, within an index, a file with a given name is written
> > only once. Index versions are sets of files, and a new version of an
> > index is likely to share most files with the prior version. Versions
> > are numbered. An index server should keep old versions of each index
> > for a while, not immediately removing old files.
> >
> > public class IndexVersion {
> > String Id; // unique name of the index
> > int version; // the version of the index
> > }
> >
> > public class IndexLocation {
> > IndexVersion indexVersion;
> > InetSocketAddress location;
> > }
> >
> > public interface ClientToMasterProtocol {
> > IndexLocation[] getSearchableIndexes();
> > IndexLocation getUpdateableIndex(String id);
> > }
> >
> > public interface ClientToSlaveProtocol {
> > // normal update
> > void addDocument(String index, Document doc);
> > int[] removeDocuments(String index, Term term);
> > void commitVersion(String index);
> >
> > // batch update
> > void addIndex(String index, IndexLocation indexToAdd);
> >
> > // search
> > SearchResults search(IndexVersion i, Query query, Sort sort, int n);
> > }
> >
> > public interface SlaveToMasterProtocol {
> > // sends currently searchable indexes
> > // recieves updated indexes that we should replicate/update
> > public IndexLocation[] heartbeat(IndexVersion[] searchableIndexes);
> > }
> >
> > public interface SlaveToSlaveProtocol {
> > String[] getFileSet(IndexVersion indexVersion);
> > byte[] getFileContent(IndexVersion indexVersion, String file);
> > // based on experience in Hadoop, we probably wouldn't really use
> > // RPC to send file content, but rather HTTP.
> > }
> >
> > The master thus maintains the set of indexes that are available for
> > search, keeps track of which slave should handle changes to an
> > index and
> > initiates index synchronization between slaves. The master can be
> > configured to replicate indexes a specified number of times.
> >
> > The client library can cache the current set of searchable indexes and
> > periodically refresh it. Searches are broadcast to one index with
> > each
> > id and return merged results. The client will load-balance both
> > searches and updates.
> >
> > Deletions could be broadcast to all slaves. That would probably be
> > fast
> > enough. Alternately, indexes could be partitioned by a hash of each
> > document's unique id, permitting deletions to be routed to the
> > appropriate slave.
> >
> > Does this make sense? Does it sound like it would be useful to Solr?
> > To Nutch? To others? Who would be interested and able to work on it?
> >
> > Doug
> >
>
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> 101tec Inc.
> search tech for web 2.1
> Menlo Park, California
> http://www.101tec.com
>
>
>
>
>

Re: [Fwd: [PROPOSAL] index server project] [ In reply to ]

otis_gospodnetic at yahoo

Oct 20, 2006, 12:52 AM

Post #3 of 4 (1659 views)

Permalink

That's distributed indexed, built on top of Sun Grid. The project won a $50K prize.

----- Original Message ----
From: Alexandru Popescu <the.mindstorm.mailinglist@gmail.com>
To: general@lucene.apache.org
Sent: Thursday, October 19, 2006 10:19:00 AM
Subject: Re: [Fwd: [PROPOSAL] index server project]

I am not sure this is (somehow) related, but I think I have noticed
some project on a Sun contest (it was the big prize winner). I cannot
retrieve it now, but hopefully somebody else will.

./alex
--
.w( the_mindstorm )p.

On 10/19/06, Stefan Groschupf <sg@101tec.com> wrote:
> Hi Doug,
>
> we discussed the need of such a tool several times internally and
> developed some workarounds for nutch, so I would be definitely
> interested to contribute to such a project.
> Having a separated project that depends on hadoop would be the best
> case for our usecases.
>
> Best,
> Stefan
>
>
>
> Am 18.10.2006 um 23:35 schrieb Doug Cutting:
>
> > FYI, I just pitched a new project you might be interested in on
> > general@lucene.com. Dunno if you subscribe to that list, so I'm
> > spamming you. If it sounds interesting, please reply there. My
> > management at Y! is interested in this, so I'm 'in'.
> >
> > Doug
> >
> > -------- Original Message --------
> > Subject: [PROPOSAL] index server project
> > Date: Wed, 18 Oct 2006 14:17:30 -0700
> > From: Doug Cutting <cutting@apache.org>
> > Reply-To: general@lucene.apache.org
> > To: general@lucene.apache.org
> >
> > It seems that Nutch and Solr would benefit from a shared index serving
> > infrastructure. Other Lucene-based projects might also benefit from
> > this. So perhaps we should start a new project to build such a thing.
> > This could start either in java/contrib, or as a separate sub-project,
> > depending on interest.
> >
> > Here are some quick ideas about how this might work.
> >
> > An RPC mechanism would be used to communicate between nodes (probably
> > Hadoop's). The system would be configured with a single master node
> > that keeps track of where indexes are located, and a number of slave
> > nodes that would maintain, search and replicate indexes. Clients
> > would
> > talk to the master to find out which indexes to search or update, then
> > they'll talk directly to slaves to perform searches and updates.
> >
> > Following is an outline of how this might look.
> >
> > We assume that, within an index, a file with a given name is written
> > only once. Index versions are sets of files, and a new version of an
> > index is likely to share most files with the prior version. Versions
> > are numbered. An index server should keep old versions of each index
> > for a while, not immediately removing old files.
> >
> > public class IndexVersion {
> > String Id; // unique name of the index
> > int version; // the version of the index
> > }
> >
> > public class IndexLocation {
> > IndexVersion indexVersion;
> > InetSocketAddress location;
> > }
> >
> > public interface ClientToMasterProtocol {
> > IndexLocation[] getSearchableIndexes();
> > IndexLocation getUpdateableIndex(String id);
> > }
> >
> > public interface ClientToSlaveProtocol {
> > // normal update
> > void addDocument(String index, Document doc);
> > int[] removeDocuments(String index, Term term);
> > void commitVersion(String index);
> >
> > // batch update
> > void addIndex(String index, IndexLocation indexToAdd);
> >
> > // search
> > SearchResults search(IndexVersion i, Query query, Sort sort, int n);
> > }
> >
> > public interface SlaveToMasterProtocol {
> > // sends currently searchable indexes
> > // recieves updated indexes that we should replicate/update
> > public IndexLocation[] heartbeat(IndexVersion[] searchableIndexes);
> > }
> >
> > public interface SlaveToSlaveProtocol {
> > String[] getFileSet(IndexVersion indexVersion);
> > byte[] getFileContent(IndexVersion indexVersion, String file);
> > // based on experience in Hadoop, we probably wouldn't really use
> > // RPC to send file content, but rather HTTP.
> > }
> >
> > The master thus maintains the set of indexes that are available for
> > search, keeps track of which slave should handle changes to an
> > index and
> > initiates index synchronization between slaves. The master can be
> > configured to replicate indexes a specified number of times.
> >
> > The client library can cache the current set of searchable indexes and
> > periodically refresh it. Searches are broadcast to one index with
> > each
> > id and return merged results. The client will load-balance both
> > searches and updates.
> >
> > Deletions could be broadcast to all slaves. That would probably be
> > fast
> > enough. Alternately, indexes could be partitioned by a hash of each
> > document's unique id, permitting deletions to be routed to the
> > appropriate slave.
> >
> > Does this make sense? Does it sound like it would be useful to Solr?
> > To Nutch? To others? Who would be interested and able to work on it?
> >
> > Doug
> >
>
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> 101tec Inc.
> search tech for web 2.1
> Menlo Park, California
> http://www.101tec.com
>
>
>
>
>

Re: [Fwd: [PROPOSAL] index server project] [ In reply to ]

otis_gospodnetic at yahoo

Oct 20, 2006, 12:53 AM

Post #4 of 4 (1671 views)

Permalink

Damn Y! mail shortcut.
The link to the project is in my Lucene group: http://www.simpy.com/group/363

Otis

----- Original Message ----
From: Alexandru Popescu <the.mindstorm.mailinglist@gmail.com>
To: general@lucene.apache.org
Sent: Thursday, October 19, 2006 10:19:00 AM
Subject: Re: [Fwd: [PROPOSAL] index server project]

I am not sure this is (somehow) related, but I think I have noticed
some project on a Sun contest (it was the big prize winner). I cannot
retrieve it now, but hopefully somebody else will.

./alex
--
.w( the_mindstorm )p.

On 10/19/06, Stefan Groschupf <sg@101tec.com> wrote:
> Hi Doug,
>
> we discussed the need of such a tool several times internally and
> developed some workarounds for nutch, so I would be definitely
> interested to contribute to such a project.
> Having a separated project that depends on hadoop would be the best
> case for our usecases.
>
> Best,
> Stefan
>
>
>
> Am 18.10.2006 um 23:35 schrieb Doug Cutting:
>
> > FYI, I just pitched a new project you might be interested in on
> > general@lucene.com. Dunno if you subscribe to that list, so I'm
> > spamming you. If it sounds interesting, please reply there. My
> > management at Y! is interested in this, so I'm 'in'.
> >
> > Doug
> >
> > -------- Original Message --------
> > Subject: [PROPOSAL] index server project
> > Date: Wed, 18 Oct 2006 14:17:30 -0700
> > From: Doug Cutting <cutting@apache.org>
> > Reply-To: general@lucene.apache.org
> > To: general@lucene.apache.org
> >
> > It seems that Nutch and Solr would benefit from a shared index serving
> > infrastructure. Other Lucene-based projects might also benefit from
> > this. So perhaps we should start a new project to build such a thing.
> > This could start either in java/contrib, or as a separate sub-project,
> > depending on interest.
> >
> > Here are some quick ideas about how this might work.
> >
> > An RPC mechanism would be used to communicate between nodes (probably
> > Hadoop's). The system would be configured with a single master node
> > that keeps track of where indexes are located, and a number of slave
> > nodes that would maintain, search and replicate indexes. Clients
> > would
> > talk to the master to find out which indexes to search or update, then
> > they'll talk directly to slaves to perform searches and updates.
> >
> > Following is an outline of how this might look.
> >
> > We assume that, within an index, a file with a given name is written
> > only once. Index versions are sets of files, and a new version of an
> > index is likely to share most files with the prior version. Versions
> > are numbered. An index server should keep old versions of each index
> > for a while, not immediately removing old files.
> >
> > public class IndexVersion {
> > String Id; // unique name of the index
> > int version; // the version of the index
> > }
> >
> > public class IndexLocation {
> > IndexVersion indexVersion;
> > InetSocketAddress location;
> > }
> >
> > public interface ClientToMasterProtocol {
> > IndexLocation[] getSearchableIndexes();
> > IndexLocation getUpdateableIndex(String id);
> > }
> >
> > public interface ClientToSlaveProtocol {
> > // normal update
> > void addDocument(String index, Document doc);
> > int[] removeDocuments(String index, Term term);
> > void commitVersion(String index);
> >
> > // batch update
> > void addIndex(String index, IndexLocation indexToAdd);
> >
> > // search
> > SearchResults search(IndexVersion i, Query query, Sort sort, int n);
> > }
> >
> > public interface SlaveToMasterProtocol {
> > // sends currently searchable indexes
> > // recieves updated indexes that we should replicate/update
> > public IndexLocation[] heartbeat(IndexVersion[] searchableIndexes);
> > }
> >
> > public interface SlaveToSlaveProtocol {
> > String[] getFileSet(IndexVersion indexVersion);
> > byte[] getFileContent(IndexVersion indexVersion, String file);
> > // based on experience in Hadoop, we probably wouldn't really use
> > // RPC to send file content, but rather HTTP.
> > }
> >
> > The master thus maintains the set of indexes that are available for
> > search, keeps track of which slave should handle changes to an
> > index and
> > initiates index synchronization between slaves. The master can be
> > configured to replicate indexes a specified number of times.
> >
> > The client library can cache the current set of searchable indexes and
> > periodically refresh it. Searches are broadcast to one index with
> > each
> > id and return merged results. The client will load-balance both
> > searches and updates.
> >
> > Deletions could be broadcast to all slaves. That would probably be
> > fast
> > enough. Alternately, indexes could be partitioned by a hash of each
> > document's unique id, permitting deletions to be routed to the
> > appropriate slave.
> >
> > Does this make sense? Does it sound like it would be useful to Solr?
> > To Nutch? To others? Who would be interested and able to work on it?
> >
> > Doug
> >
>
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> 101tec Inc.
> search tech for web 2.1
> Menlo Park, California
> http://www.101tec.com
>
>
>
>
>