Mailing List Archive

Infrastructure for large Lucene index
I am dealing with pretty challenging task, so I thought it would be
a good idea to ask community before I re-invent any wheels of my own.

I have a Lucene index that is going to grow to 100GB soon. This is
index going to be read very aggresively (10s of millions requests
per day) with some occasional updates (10 batches per day).

The idea is to split load between multiple server nodes running Lucene
on *nix while accessing the same index that is shared across the network.

I am wondering if it's a good idea and/or if there are any recommendations
regarding selecting/tweaking network configuration (software+hardware)
for an index of this size.

Thank you.

Slava Imeshev
RE: Infrastructure for large Lucene index [ In reply to ]
Hi Slava,

We currently do this across many machines for
http://www.FreePatentsOnline.com. Our indexes are, in aggregate across our
various collections, even larger than you need. We use Remote
ParalellMultiSearcher, with some custom modifications (and we are in the
process of making more) to allow most robust handling of many processes at
once and integration of the responses from various sub-indexes. This works
fine on commodity hardware, and you will be IO bound, so get multiple drives
in each machine.

Out of curiosity, what project are you working on? That's a lot of hits!

Sincerely,
James Ryley, Ph.D.
www.FreePatentsOnline.com


> -----Original Message-----
> From: Slava Imeshev [mailto:imeshev@yahoo.com]
> Sent: Friday, October 06, 2006 2:28 PM
> To: general@lucene.apache.org
> Subject: Infrastructure for large Lucene index
>
>
> I am dealing with pretty challenging task, so I thought it would be
> a good idea to ask community before I re-invent any wheels of my own.
>
> I have a Lucene index that is going to grow to 100GB soon. This is
> index going to be read very aggresively (10s of millions requests
> per day) with some occasional updates (10 batches per day).
>
> The idea is to split load between multiple server nodes running Lucene
> on *nix while accessing the same index that is shared across the network.
>
> I am wondering if it's a good idea and/or if there are any recommendations
> regarding selecting/tweaking network configuration (software+hardware)
> for an index of this size.
>
> Thank you.
>
> Slava Imeshev
RE: Infrastructure for large Lucene index [ In reply to ]
I may have misinterpreted your email in my initial response. Are you saying
you want nodes (presumably for more CPUs) that all access the same shared
index (on Network Attached Storage, presumably)?

If so, I think you are going to have read and write performance issues
unless you are using some SERIOUS storage system. If you aren't already
committed to the hardware configuration you seem to be describing, I would
go with commodity hardware and split the indexes across each machine -- data
locality is going to be important.

Sincerely,
James Ryley, Ph.D.
www.FreePatentsOnline.com


> -----Original Message-----
> From: Slava Imeshev [mailto:imeshev@yahoo.com]
> Sent: Friday, October 06, 2006 2:28 PM
> To: general@lucene.apache.org
> Subject: Infrastructure for large Lucene index
>
>
> I am dealing with pretty challenging task, so I thought it would be
> a good idea to ask community before I re-invent any wheels of my own.
>
> I have a Lucene index that is going to grow to 100GB soon. This is
> index going to be read very aggresively (10s of millions requests
> per day) with some occasional updates (10 batches per day).
>
> The idea is to split load between multiple server nodes running Lucene
> on *nix while accessing the same index that is shared across the network.
>
> I am wondering if it's a good idea and/or if there are any recommendations
> regarding selecting/tweaking network configuration (software+hardware)
> for an index of this size.
>
> Thank you.
>
> Slava Imeshev
Re: Infrastructure for large Lucene index [ In reply to ]
James wrote:
> We use Remote
> ParalellMultiSearcher, with some custom modifications (and we are in the
> process of making more) to allow most robust handling of many processes at
> once and integration of the responses from various sub-indexes.

Can you share these modifications with others? If so, please attach
them to an issue in Jira.

Thanks,

Doug
RE: Infrastructure for large Lucene index [ In reply to ]
Hi Doug,

I want to say yes -- I think it would only be appropriate given the great
tools you have given us for free. Let me check with the powers that be
here, and then get the code into a more polished form. We hope to have it
really enterprise-ready over the next couple months.

Sincerely,
James

> -----Original Message-----
> From: Doug Cutting [mailto:cutting@apache.org]
> Sent: Friday, October 06, 2006 4:15 PM
> To: general@lucene.apache.org
> Subject: Re: Infrastructure for large Lucene index
>
> James wrote:
> > We use Remote
> > ParalellMultiSearcher, with some custom modifications (and we are in the
> > process of making more) to allow most robust handling of many processes
> at
> > once and integration of the responses from various sub-indexes.
>
> Can you share these modifications with others? If so, please attach
> them to an issue in Jira.
>
> Thanks,
>
> Doug
RE: Infrastructure for large Lucene index [ In reply to ]
James,

--- James <james@ryley.com> wrote:
> We currently do this across many machines for
> http://www.FreePatentsOnline.com. Our indexes are, in aggregate across our
> various collections, even larger than you need. We use Remote
> ParalellMultiSearcher, with some custom modifications (and we are in the
> process of making more) to allow most robust handling of many processes at

I am not sure if ParalellMultiSearcher going to help here because we have
a large uniform index, not a set of collections.

> once and integration of the responses from various sub-indexes. This works
> fine on commodity hardware, and you will be IO bound, so get multiple drives
> in each machine.
>
> Out of curiosity, what project are you working on? That's a lot of hits!
>
> Sincerely,
> James Ryley, Ph.D.
> www.FreePatentsOnline.com
>
>
> > -----Original Message-----
> > From: Slava Imeshev [mailto:imeshev@yahoo.com]
> > Sent: Friday, October 06, 2006 2:28 PM
> > To: general@lucene.apache.org
> > Subject: Infrastructure for large Lucene index
> >
> >
> > I am dealing with pretty challenging task, so I thought it would be
> > a good idea to ask community before I re-invent any wheels of my own.
> >
> > I have a Lucene index that is going to grow to 100GB soon. This is
> > index going to be read very aggresively (10s of millions requests
> > per day) with some occasional updates (10 batches per day).
> >
> > The idea is to split load between multiple server nodes running Lucene
> > on *nix while accessing the same index that is shared across the network.
> >
> > I am wondering if it's a good idea and/or if there are any recommendations
> > regarding selecting/tweaking network configuration (software+hardware)
> > for an index of this size.

Regards,

Slava Imeshev
Re: Infrastructure for large Lucene index [ In reply to ]
On 10/6/06, James <james@ryley.com> wrote:
> Our indexes are, in aggregate across our
> various collections, even larger than you need. We use Remote
> ParalellMultiSearcher, with some custom modifications (and we are in the
> process of making more)

I'm looking into adding some form of distributed search to Solr.
The main problem I see with directly using ParallelMultiSearcher is a
lack of high availability features.

If the index is broken into multiple "shards" then we need multiple
copies of each shard, and some way of loadbalancing and failing over
amongst copies of shards.

-Yonik
http://incubator.apache.org/solr Solr, the open-source Lucene search server
RE: Infrastructure for large Lucene index [ In reply to ]
We have both. Although we have multiple collections, our largest
collections are still way too big for one machine. You have to come up with
a scheme to randomly split documents across multiple servers (randomly so
that word frequency issues hopefully don't mean one index is getting pounded
while others are inactive for certain searches).

Sincerely,
James

> -----Original Message-----
> From: Slava Imeshev [mailto:imeshev@yahoo.com]
> Sent: Friday, October 06, 2006 4:33 PM
> To: general@lucene.apache.org
> Subject: RE: Infrastructure for large Lucene index
>
> James,
>
> --- James <james@ryley.com> wrote:
> > We currently do this across many machines for
> > http://www.FreePatentsOnline.com. Our indexes are, in aggregate across
> our
> > various collections, even larger than you need. We use Remote
> > ParalellMultiSearcher, with some custom modifications (and we are in the
> > process of making more) to allow most robust handling of many processes
> at
>
> I am not sure if ParalellMultiSearcher going to help here because we have
> a large uniform index, not a set of collections.
>
> > once and integration of the responses from various sub-indexes. This
> works
> > fine on commodity hardware, and you will be IO bound, so get multiple
> drives
> > in each machine.
> >
> > Out of curiosity, what project are you working on? That's a lot of
> hits!
> >
> > Sincerely,
> > James Ryley, Ph.D.
> > www.FreePatentsOnline.com
> >
> >
> > > -----Original Message-----
> > > From: Slava Imeshev [mailto:imeshev@yahoo.com]
> > > Sent: Friday, October 06, 2006 2:28 PM
> > > To: general@lucene.apache.org
> > > Subject: Infrastructure for large Lucene index
> > >
> > >
> > > I am dealing with pretty challenging task, so I thought it would be
> > > a good idea to ask community before I re-invent any wheels of my own.
> > >
> > > I have a Lucene index that is going to grow to 100GB soon. This is
> > > index going to be read very aggresively (10s of millions requests
> > > per day) with some occasional updates (10 batches per day).
> > >
> > > The idea is to split load between multiple server nodes running Lucene
> > > on *nix while accessing the same index that is shared across the
> network.
> > >
> > > I am wondering if it's a good idea and/or if there are any
> recommendations
> > > regarding selecting/tweaking network configuration (software+hardware)
> > > for an index of this size.
>
> Regards,
>
> Slava Imeshev
RE: Infrastructure for large Lucene index [ In reply to ]
> If the index is broken into multiple "shards" then we need multiple copies
of each shard, and some way of loadbalancing and failing over amongst copies
of shards.



Yep. Unfortunately it's not simple, but those are all pieces of what we are
currently in the process of implementing.



Sincerely,

James Ryley, Ph.D.

www.FreePatentsOnline.com <http://www.freepatentsonline.com/>



> -----Original Message-----

> From: yseeley@gmail.com [mailto:yseeley@gmail.com] On Behalf Of Yonik

> Seeley

> Sent: Friday, October 06, 2006 4:37 PM

> To: general@lucene.apache.org

> Subject: Re: Infrastructure for large Lucene index

>

> On 10/6/06, James <james@ryley.com> wrote:

> > Our indexes are, in aggregate across our

> > various collections, even larger than you need. We use Remote

> > ParalellMultiSearcher, with some custom modifications (and we are in the

> > process of making more)

>

> I'm looking into adding some form of distributed search to Solr.

> The main problem I see with directly using ParallelMultiSearcher is a

> lack of high availability features.

>

> If the index is broken into multiple "shards" then we need multiple

> copies of each shard, and some way of loadbalancing and failing over

> amongst copies of shards.

>

> -Yonik

> http://incubator.apache.org/solr Solr, the open-source Lucene search

> server
RE: Infrastructure for large Lucene index [ In reply to ]
--- James <james@ryley.com> wrote:
> I may have misinterpreted your email in my initial response. Are you saying
> you want nodes (presumably for more CPUs) that all access the same shared
> index (on Network Attached Storage, presumably)?

Yes, that's right.

> If so, I think you are going to have read and write performance issues
> unless you are using some SERIOUS storage system.

Yes, that's what I am trying to figure out, how serious it should be.

> If you aren't already
> committed to the hardware configuration you seem to be describing,

I am not.

> I would go with commodity hardware and split the indexes across each machine -- data
> locality is going to be important.

This is understood, but that is not going to work for searching for "cat dog"
when the "cat" is in one index and the "dog" in another.

Slava

>
> Sincerely,
> James Ryley, Ph.D.
> www.FreePatentsOnline.com
>
>
> > -----Original Message-----
> > From: Slava Imeshev [mailto:imeshev@yahoo.com]
> > Sent: Friday, October 06, 2006 2:28 PM
> > To: general@lucene.apache.org
> > Subject: Infrastructure for large Lucene index
> >
> >
> > I am dealing with pretty challenging task, so I thought it would be
> > a good idea to ask community before I re-invent any wheels of my own.
> >
> > I have a Lucene index that is going to grow to 100GB soon. This is
> > index going to be read very aggresively (10s of millions requests
> > per day) with some occasional updates (10 batches per day).
> >
> > The idea is to split load between multiple server nodes running Lucene
> > on *nix while accessing the same index that is shared across the network.
> >
> > I am wondering if it's a good idea and/or if there are any recommendations
> > regarding selecting/tweaking network configuration (software+hardware)
> > for an index of this size.
> >
> > Thank you.
> >
> > Slava Imeshev
>
>
>
RE: Infrastructure for large Lucene index [ In reply to ]
You don't separate the index by having "cat" in one and "dog" in another.
You separate it by document, so that both indexes have cat and dog, but the
indexes are smaller, meaning that response time is greatly increased.

> -----Original Message-----
> From: Slava Imeshev [mailto:imeshev@yahoo.com]
> Sent: Friday, October 06, 2006 4:59 PM
> To: general@lucene.apache.org
> Subject: RE: Infrastructure for large Lucene index
>
> --- James <james@ryley.com> wrote:
> > I may have misinterpreted your email in my initial response. Are you
> saying
> > you want nodes (presumably for more CPUs) that all access the same
> shared
> > index (on Network Attached Storage, presumably)?
>
> Yes, that's right.
>
> > If so, I think you are going to have read and write performance issues
> > unless you are using some SERIOUS storage system.
>
> Yes, that's what I am trying to figure out, how serious it should be.
>
> > If you aren't already
> > committed to the hardware configuration you seem to be describing,
>
> I am not.
>
> > I would go with commodity hardware and split the indexes across each
> machine -- data
> > locality is going to be important.
>
> This is understood, but that is not going to work for searching for "cat
> dog"
> when the "cat" is in one index and the "dog" in another.
>
> Slava
>
> >
> > Sincerely,
> > James Ryley, Ph.D.
> > www.FreePatentsOnline.com
> >
> >
> > > -----Original Message-----
> > > From: Slava Imeshev [mailto:imeshev@yahoo.com]
> > > Sent: Friday, October 06, 2006 2:28 PM
> > > To: general@lucene.apache.org
> > > Subject: Infrastructure for large Lucene index
> > >
> > >
> > > I am dealing with pretty challenging task, so I thought it would be
> > > a good idea to ask community before I re-invent any wheels of my own.
> > >
> > > I have a Lucene index that is going to grow to 100GB soon. This is
> > > index going to be read very aggresively (10s of millions requests
> > > per day) with some occasional updates (10 batches per day).
> > >
> > > The idea is to split load between multiple server nodes running Lucene
> > > on *nix while accessing the same index that is shared across the
> network.
> > >
> > > I am wondering if it's a good idea and/or if there are any
> recommendations
> > > regarding selecting/tweaking network configuration (software+hardware)
> > > for an index of this size.
> > >
> > > Thank you.
> > >
> > > Slava Imeshev
> >
> >
> >
RE: Infrastructure for large Lucene index [ In reply to ]
--- James <james@ryley.com> wrote:
> You don't separate the index by having "cat" in one and "dog" in another.
> You separate it by document, so that both indexes have cat and dog, but the
> indexes are smaller, meaning that response time is greatly increased.

I think I oversimplified the problem with this example.

Slava

>
> > -----Original Message-----
> > From: Slava Imeshev [mailto:imeshev@yahoo.com]
> > Sent: Friday, October 06, 2006 4:59 PM
> > To: general@lucene.apache.org
> > Subject: RE: Infrastructure for large Lucene index
> >
> > --- James <james@ryley.com> wrote:
> > > I may have misinterpreted your email in my initial response. Are you
> > saying
> > > you want nodes (presumably for more CPUs) that all access the same
> > shared
> > > index (on Network Attached Storage, presumably)?
> >
> > Yes, that's right.
> >
> > > If so, I think you are going to have read and write performance issues
> > > unless you are using some SERIOUS storage system.
> >
> > Yes, that's what I am trying to figure out, how serious it should be.
> >
> > > If you aren't already
> > > committed to the hardware configuration you seem to be describing,
> >
> > I am not.
> >
> > > I would go with commodity hardware and split the indexes across each
> > machine -- data
> > > locality is going to be important.
> >
> > This is understood, but that is not going to work for searching for "cat
> > dog"
> > when the "cat" is in one index and the "dog" in another.
> >
> > Slava
> >
> > >
> > > Sincerely,
> > > James Ryley, Ph.D.
> > > www.FreePatentsOnline.com
> > >
> > >
> > > > -----Original Message-----
> > > > From: Slava Imeshev [mailto:imeshev@yahoo.com]
> > > > Sent: Friday, October 06, 2006 2:28 PM
> > > > To: general@lucene.apache.org
> > > > Subject: Infrastructure for large Lucene index
> > > >
> > > >
> > > > I am dealing with pretty challenging task, so I thought it would be
> > > > a good idea to ask community before I re-invent any wheels of my own.
> > > >
> > > > I have a Lucene index that is going to grow to 100GB soon. This is
> > > > index going to be read very aggresively (10s of millions requests
> > > > per day) with some occasional updates (10 batches per day).
> > > >
> > > > The idea is to split load between multiple server nodes running Lucene
> > > > on *nix while accessing the same index that is shared across the
> > network.
> > > >
> > > > I am wondering if it's a good idea and/or if there are any
> > recommendations
> > > > regarding selecting/tweaking network configuration (software+hardware)
> > > > for an index of this size.
> > > >
> > > > Thank you.
> > > >
> > > > Slava Imeshev
> > >
> > >
> > >
>
>
>
RE: Infrastructure for large Lucene index [ In reply to ]
--- James <james@ryley.com> wrote:
> We have both. Although we have multiple collections, our largest
> collections are still way too big for one machine. You have to come up with
> a scheme to randomly split documents across multiple servers (randomly so
> that word frequency issues hopefully don't mean one index is getting pounded
> while others are inactive for certain searches).

Yes, this is valid concern.

Slava


>
> Sincerely,
> James
>
> > -----Original Message-----
> > From: Slava Imeshev [mailto:imeshev@yahoo.com]
> > Sent: Friday, October 06, 2006 4:33 PM
> > To: general@lucene.apache.org
> > Subject: RE: Infrastructure for large Lucene index
> >
> > James,
> >
> > --- James <james@ryley.com> wrote:
> > > We currently do this across many machines for
> > > http://www.FreePatentsOnline.com. Our indexes are, in aggregate across
> > our
> > > various collections, even larger than you need. We use Remote
> > > ParalellMultiSearcher, with some custom modifications (and we are in the
> > > process of making more) to allow most robust handling of many processes
> > at
> >
> > I am not sure if ParalellMultiSearcher going to help here because we have
> > a large uniform index, not a set of collections.
> >
> > > once and integration of the responses from various sub-indexes. This
> > works
> > > fine on commodity hardware, and you will be IO bound, so get multiple
> > drives
> > > in each machine.
> > >
> > > Out of curiosity, what project are you working on? That's a lot of
> > hits!
> > >
> > > Sincerely,
> > > James Ryley, Ph.D.
> > > www.FreePatentsOnline.com
> > >
> > >
> > > > -----Original Message-----
> > > > From: Slava Imeshev [mailto:imeshev@yahoo.com]
> > > > Sent: Friday, October 06, 2006 2:28 PM
> > > > To: general@lucene.apache.org
> > > > Subject: Infrastructure for large Lucene index
> > > >
> > > >
> > > > I am dealing with pretty challenging task, so I thought it would be
> > > > a good idea to ask community before I re-invent any wheels of my own.
> > > >
> > > > I have a Lucene index that is going to grow to 100GB soon. This is
> > > > index going to be read very aggresively (10s of millions requests
> > > > per day) with some occasional updates (10 batches per day).
> > > >
> > > > The idea is to split load between multiple server nodes running Lucene
> > > > on *nix while accessing the same index that is shared across the
> > network.
> > > >
> > > > I am wondering if it's a good idea and/or if there are any
> > recommendations
> > > > regarding selecting/tweaking network configuration (software+hardware)
> > > > for an index of this size.
> >
> > Regards,
> >
> > Slava Imeshev
>
>
>
RE: Infrastructure for large Lucene index [ In reply to ]
-- James <james@ryley.com> wrote:
> > If the index is broken into multiple "shards" then we need multiple copies
> of each shard, and some way of loadbalancing and failing over amongst copies
> of shards.
>
> Yep. Unfortunately it's not simple, but those are all pieces of what we are
> currently in the process of implementing.

The problem is that over time indexes develop "personality" and the term frequency
can be vary significantly from index to index....

Slava



>
>
>
> Sincerely,
>
> James Ryley, Ph.D.
>
> www.FreePatentsOnline.com <http://www.freepatentsonline.com/>
>
>
>
> > -----Original Message-----
>
> > From: yseeley@gmail.com [mailto:yseeley@gmail.com] On Behalf Of Yonik
>
> > Seeley
>
> > Sent: Friday, October 06, 2006 4:37 PM
>
> > To: general@lucene.apache.org
>
> > Subject: Re: Infrastructure for large Lucene index
>
> >
>
> > On 10/6/06, James <james@ryley.com> wrote:
>
> > > Our indexes are, in aggregate across our
>
> > > various collections, even larger than you need. We use Remote
>
> > > ParalellMultiSearcher, with some custom modifications (and we are in the
>
> > > process of making more)
>
> >
>
> > I'm looking into adding some form of distributed search to Solr.
>
> > The main problem I see with directly using ParallelMultiSearcher is a
>
> > lack of high availability features.
>
> >
>
> > If the index is broken into multiple "shards" then we need multiple
>
> > copies of each shard, and some way of loadbalancing and failing over
>
> > amongst copies of shards.
>
> >
>
> > -Yonik
>
> > http://incubator.apache.org/solr Solr, the open-source Lucene search
>
> > server
>
>
RE: Infrastructure for large Lucene index [ In reply to ]
Agreed. For example, with patents we have to be concerned about
technology-related terms that are more prominent in certain time periods. I
think a good random assignment scheme addresses most such problems, but
worst case you can always redo the indexes entirely if they get too
non-random.

Sincerely,
James

> -----Original Message-----
> From: Slava Imeshev [mailto:imeshev@yahoo.com]
> Sent: Friday, October 06, 2006 5:27 PM
> To: general@lucene.apache.org
> Subject: RE: Infrastructure for large Lucene index
>
> -- James <james@ryley.com> wrote:
> > > If the index is broken into multiple "shards" then we need multiple
> copies
> > of each shard, and some way of loadbalancing and failing over amongst
> copies
> > of shards.
> >
> > Yep. Unfortunately it's not simple, but those are all pieces of what we
> are
> > currently in the process of implementing.
>
> The problem is that over time indexes develop "personality" and the term
> frequency
> can be vary significantly from index to index....
>
> Slava
>
>
>
> >
> >
> >
> > Sincerely,
> >
> > James Ryley, Ph.D.
> >
> > www.FreePatentsOnline.com <http://www.freepatentsonline.com/>
> >
> >
> >
> > > -----Original Message-----
> >
> > > From: yseeley@gmail.com [mailto:yseeley@gmail.com] On Behalf Of Yonik
> >
> > > Seeley
> >
> > > Sent: Friday, October 06, 2006 4:37 PM
> >
> > > To: general@lucene.apache.org
> >
> > > Subject: Re: Infrastructure for large Lucene index
> >
> > >
> >
> > > On 10/6/06, James <james@ryley.com> wrote:
> >
> > > > Our indexes are, in aggregate across our
> >
> > > > various collections, even larger than you need. We use Remote
> >
> > > > ParalellMultiSearcher, with some custom modifications (and we are in
> the
> >
> > > > process of making more)
> >
> > >
> >
> > > I'm looking into adding some form of distributed search to Solr.
> >
> > > The main problem I see with directly using ParallelMultiSearcher is a
> >
> > > lack of high availability features.
> >
> > >
> >
> > > If the index is broken into multiple "shards" then we need multiple
> >
> > > copies of each shard, and some way of loadbalancing and failing over
> >
> > > amongst copies of shards.
> >
> > >
> >
> > > -Yonik
> >
> > > http://incubator.apache.org/solr Solr, the open-source Lucene search
> >
> > > server
> >
> >
Re: Infrastructure for large Lucene index [ In reply to ]
On 10/6/06, Slava Imeshev <imeshev@yahoo.com> wrote:
> -- James <james@ryley.com> wrote:
> > > If the index is broken into multiple "shards" then we need multiple copies
> > of each shard, and some way of loadbalancing and failing over amongst copies
> > of shards.
> >
> > Yep. Unfortunately it's not simple, but those are all pieces of what we are
> > currently in the process of implementing.
>
> The problem is that over time indexes develop "personality" and the term frequency
> can be vary significantly from index to index....

A global idf calculation is possible though... MultiSearcher already
does this when searching across multiple indicies. The downside of
doing it across remote indicies is an increase in the number of RPC
calls. In general, it's probably better to try and keep index shards
balanced.


-Yonik
http://incubator.apache.org/solr Solr, the open-source Lucene search server
Re: Infrastructure for large Lucene index [ In reply to ]
James wrote:
> Let me check with the powers that be
> here, and then get the code into a more polished form. We hope to have it
> really enterprise-ready over the next couple months.

Great! Once you have permission, please post it sooner rather than
later, then others can help with polishing, or at least be informed by
your methods. What I'd hate to happen is for you to get permission but
never have time to polish it, and hence never contribute it. An
unpolished patch is better than no patch at all.

Thanks!

Doug
Re: Infrastructure for large Lucene index [ In reply to ]
>I am dealing with pretty challenging task, so I thought it would be
>a good idea to ask community before I re-invent any wheels of my own.
>
>I have a Lucene index that is going to grow to 100GB soon. This is
>index going to be read very aggresively (10s of millions requests
>per day) with some occasional updates (10 batches per day).
>
>The idea is to split load between multiple server nodes running Lucene
>on *nix while accessing the same index that is shared across the network.
>
>I am wondering if it's a good idea and/or if there are any recommendations
>regarding selecting/tweaking network configuration (software+hardware)
>for an index of this size.

A few quick comments to this, including some of the subsequent thread
discussion.

1. Unless you have a lot of RAM, sufficient to effectively keep the
entire index in memory, you're better off maximizing the number of
spindles. Using one big file server is, IMHO, a Really Bad Idea for
this type of application. You'll pay top dollar for something with
the reliability and performance that you think you need, and then
you'll still wind up being I/O bound.

I'd say the best configuration is a dual CPU/dual core box with 4
fast drives and a boat-load of RAM - say 8GB for starters. You run
four JVMs with four indexes on each box, where each index is on a
separate drive. Assume the file system will do a reasonable job of
caching data for you, so don't bother trying to use RAMDirectory or
MMapDirectory.

2. It's easy to get hung up on document frequency skews. As James and
others have noted, in general things seem to work OK by just
randomizing which document goes to what index - e.g. do it by hash of
the document URL/name, and make sure that every new batch of
documents (if you're doing incremental updates) gets spread this same
way. As long as your hash function has nothing to do with searchable
terms that you care about, you should be OK.

3. If you're worried about high availability, then one fairly simple
approach is to have two parallel set of search clusters, with a load
balancer in front. For each cluster, monitor both the front-end
server (where the results get combined) and each of the back-end
search servers - for example, something like Big Brother or Ganglia.
Then if one of the search servers (or, god forbid, the front end
server) goes down, you can automatically remove that cluster from the
load balancer's active set.

-- Ken
--
Ken Krugler
Krugle, Inc.
+1 530-210-6378
"Find Code, Find Answers"
Re: Infrastructure for large Lucene index [ In reply to ]
: 3. If you're worried about high availability, then one fairly simple
: approach is to have two parallel set of search clusters, with a load
: balancer in front. For each cluster, monitor both the front-end
: server (where the results get combined) and each of the back-end
: search servers - for example, something like Big Brother or Ganglia.
: Then if one of the search servers (or, god forbid, the front end
: server) goes down, you can automatically remove that cluster from the
: load balancer's active set.

the availability of this approach doesn't scale very cleanly though ... if
any one box in either cluster goes down, the entire cluster becomes
unusable. Doubling the size of your collection would only double the
number of boxes you need -- but the reliability would be cut in half,
meaning you'd really need to quadruple the number of boxes (doubling the
number of clusters) to maintain the same level of reliability ... if i'[m
not mistaken the number of boxes would need to grow quadraticly as your
index size grows linearly.

A system where every individual node in the cluster is load balanced
across 2 physical boxes would require the same amount of hardware to start
with, but would require a lot less hardware to grow.



-Hoss
Re: Infrastructure for large Lucene index [ In reply to ]
Chris Hostetter wrote:
> : 3. If you're worried about high availability, then one fairly simple
> : approach is to have two parallel set of search clusters, with a load
> : balancer in front. For each cluster, monitor both the front-end
> : server (where the results get combined) and each of the back-end
> : search servers - for example, something like Big Brother or Ganglia.
> : Then if one of the search servers (or, god forbid, the front end
> : server) goes down, you can automatically remove that cluster from the
> : load balancer's active set.
>
> the availability of this approach doesn't scale very cleanly though ... if
> any one box in either cluster goes down, the entire cluster becomes
> unusable.

A cost-effective variation works as follows: if you have 10 indexes and
11 nodes, then you keep one node as a spare. When any of the 10 active
nodes fail, the 11th resumes its duties. While the 11th node is
launching you search only 9 out of the 10 indexes, so failover is not
entirely seamless, but it's a lot cheaper than mirroring all nodes.

Doug
Re: Infrastructure for large Lucene index [ In reply to ]
Doug,

--- Doug Cutting <cutting@apache.org> wrote:

> > the availability of this approach doesn't scale very cleanly though ... if
> > any one box in either cluster goes down, the entire cluster becomes
> > unusable.
>
> A cost-effective variation works as follows: if you have 10 indexes and
> 11 nodes, then you keep one node as a spare. When any of the 10 active
> nodes fail, the 11th resumes its duties. While the 11th node is
> launching you search only 9 out of the 10 indexes, so failover is not
> entirely seamless, but it's a lot cheaper than mirroring all nodes.

How does the 11th know what index it has to bring up? In other words,
where would it get the lost index?

Slava
RE: Infrastructure for large Lucene index [ In reply to ]
Save them all and redirect the index reader appropriately when one of the
main indexes fail.

> -----Original Message-----
> From: Slava Imeshev [mailto:imeshev@yahoo.com]
> Sent: Tuesday, October 10, 2006 4:34 PM
> To: general@lucene.apache.org
> Subject: Re: Infrastructure for large Lucene index
>
> Doug,
>
> --- Doug Cutting <cutting@apache.org> wrote:
>
> > > the availability of this approach doesn't scale very cleanly though
> ... if
> > > any one box in either cluster goes down, the entire cluster becomes
> > > unusable.
> >
> > A cost-effective variation works as follows: if you have 10 indexes and
> > 11 nodes, then you keep one node as a spare. When any of the 10 active
> > nodes fail, the 11th resumes its duties. While the 11th node is
> > launching you search only 9 out of the 10 indexes, so failover is not
> > entirely seamless, but it's a lot cheaper than mirroring all nodes.
>
> How does the 11th know what index it has to bring up? In other words,
> where would it get the lost index?
>
> Slava
Re: Infrastructure for large Lucene index [ In reply to ]
It sounds like the 11th node would have to have a large disk with all indices. Or perhaps you'd keep copies of all your indices elsewhere, and would pull the right one in when you see which node you need to replace.

Otis

----- Original Message ----
From: Slava Imeshev <imeshev@yahoo.com>
To: general@lucene.apache.org
Sent: Tuesday, October 10, 2006 4:34:07 PM
Subject: Re: Infrastructure for large Lucene index

Doug,

--- Doug Cutting <cutting@apache.org> wrote:

> > the availability of this approach doesn't scale very cleanly though ... if
> > any one box in either cluster goes down, the entire cluster becomes
> > unusable.
>
> A cost-effective variation works as follows: if you have 10 indexes and
> 11 nodes, then you keep one node as a spare. When any of the 10 active
> nodes fail, the 11th resumes its duties. While the 11th node is
> launching you search only 9 out of the 10 indexes, so failover is not
> entirely seamless, but it's a lot cheaper than mirroring all nodes.

How does the 11th know what index it has to bring up? In other words,
where would it get the lost index?

Slava
Re: Infrastructure for large Lucene index [ In reply to ]
would it be possible to design your solution so that you could have
multiple replicas at the shard level?

obviously you have a memory issue on how many shards a single machine
can serve, but if you make your shards small enough you might be able
to get a single machine capable of serving 3-4 shards at any one
time. so with 10 machines you would have 40 shards being able to be
served. you would
then distribute the shards so the popular ones would we on 2-3x as
many machines as the less popular ones.

loss of a machine could be handled in a non disruptive manner as
well, as long as you ensure that ALL shards are served by at least 2
machines in your cluster

would this work or be possible in your situation?

regards
Ian


On 11/10/2006, at 5:04 PM, Otis Gospodnetic wrote:

> It sounds like the 11th node would have to have a large disk with
> all indices. Or perhaps you'd keep copies of all your indices
> elsewhere, and would pull the right one in when you see which node
> you need to replace.
>
> Otis
>
> ----- Original Message ----
> From: Slava Imeshev <imeshev@yahoo.com>
> To: general@lucene.apache.org
> Sent: Tuesday, October 10, 2006 4:34:07 PM
> Subject: Re: Infrastructure for large Lucene index
>
> Doug,
>
> --- Doug Cutting <cutting@apache.org> wrote:
>
>>> the availability of this approach doesn't scale very cleanly
>>> though ... if
>>> any one box in either cluster goes down, the entire cluster becomes
>>> unusable.
>>
>> A cost-effective variation works as follows: if you have 10
>> indexes and
>> 11 nodes, then you keep one node as a spare. When any of the 10
>> active
>> nodes fail, the 11th resumes its duties. While the 11th node is
>> launching you search only 9 out of the 10 indexes, so failover is not
>> entirely seamless, but it's a lot cheaper than mirroring all nodes.
>
> How does the 11th know what index it has to bring up? In other words,
> where would it get the lost index?
>
> Slava
>
>
>
>

--
Ian Holsman
Ian@Zilbo.com
http://personalinjuryfocus.com/
Re: Infrastructure for large Lucene index [ In reply to ]
Is it possible to build lucene distributed indexing/searching system
based on JMS messaging?

I.e.

1. Indexing
- To index documents you need to send message to some queue.
- One of the nodes recieve your message and stores it into index (
selective consumer? )

2. Searching
- To search the whole collection of documents you need to send message
to "search" topic ( with specified timetolive ).
- Consume replies with specified timeout.
- Sort out results from several nodes

Regards,
Alex Serba