Mailing List Archive

Kind of hardware config ?
We want to index about 2 millions of html documents with Lucune.
Have you an idea of the machine configuration the most adapted (bi proc, 2
Go on memrory, raid disks...) ?
--
View this message in context: http://www.nabble.com/Kind-of-hardware-config---tf2176085.html#a6016661
Sent from the Lucene - General forum at Nabble.com.
RE: Kind of hardware config ? [ In reply to ]
What's the total document size?

Sincerely,
James Ryley, Ph.D.

> -----Original Message-----
> From: caribou_surf [mailto:eric@mixad.com]
> Sent: Monday, August 28, 2006 5:01 AM
> To: general@lucene.apache.org
> Subject: Kind of hardware config ?
>
>
> We want to index about 2 millions of html documents with Lucune.
> Have you an idea of the machine configuration the most adapted (bi proc, 2
> Go on memrory, raid disks...) ?
> --
> View this message in context: http://www.nabble.com/Kind-of-hardware-
> config---tf2176085.html#a6016661
> Sent from the Lucene - General forum at Nabble.com.
RE: Kind of hardware config ? [ In reply to ]
About 100 Giga



James-10 wrote:
>
> What's the total document size?
>
> Sincerely,
> James Ryley, Ph.D.
>
>> -----Original Message-----
>> From: caribou_surf [mailto:eric@mixad.com]
>> Sent: Monday, August 28, 2006 5:01 AM
>> To: general@lucene.apache.org
>> Subject: Kind of hardware config ?
>>
>>
>> We want to index about 2 millions of html documents with Lucune.
>> Have you an idea of the machine configuration the most adapted (bi proc,
>> 2
>> Go on memrory, raid disks...) ?
>> --
>> View this message in context: http://www.nabble.com/Kind-of-hardware-
>> config---tf2176085.html#a6016661
>> Sent from the Lucene - General forum at Nabble.com.
>
>
>

--
View this message in context: http://www.nabble.com/Kind-of-hardware-config---tf2176085.html#a6021457
Sent from the Lucene - General forum at Nabble.com.
RE: Kind of hardware config ? [ In reply to ]
OK, so you aren't going to get it into memory unless you spend a lot on
servers. We haven't found memory (or disk access) to be a limiting factor
anyway -- CPU is the issue. I'm not sure what you want to spend, but a
single server with SATA RAID, 4GB RAM and the latest AMD processor will
search your collection in ~10-20 seconds, depending on the complexity of the
search. If you need faster performance or the ability to support many hits
at once, you are going to have to parallelize the configuration across
multiple servers using ParallelMultiSearcher.

Keep in mind that Lucene isn't really set up to handle parallel searching
robustly. There is a lot of code you are going to have to write for an
enterprise-ready solution (e.g., checking the status of a given server to
make sure it isn't down, redundantly storing indexes so that the search
still functions if one server is down, potentially handling laggards to
increase speed, etc.).

We have done some of this, and have more to do -- it is a very non-trivial
task.

Sincerely,
James Ryley, Ph.D.

> -----Original Message-----
> From: caribou_surf [mailto:eric@mixad.com]
> Sent: Monday, August 28, 2006 10:42 AM
> To: general@lucene.apache.org
> Subject: RE: Kind of hardware config ?
>
>
> About 100 Giga
>
>
>
> James-10 wrote:
> >
> > What's the total document size?
> >
> > Sincerely,
> > James Ryley, Ph.D.
> >
> >> -----Original Message-----
> >> From: caribou_surf [mailto:eric@mixad.com]
> >> Sent: Monday, August 28, 2006 5:01 AM
> >> To: general@lucene.apache.org
> >> Subject: Kind of hardware config ?
> >>
> >>
> >> We want to index about 2 millions of html documents with Lucune.
> >> Have you an idea of the machine configuration the most adapted (bi
> proc,
> >> 2
> >> Go on memrory, raid disks...) ?
> >> --
> >> View this message in context: http://www.nabble.com/Kind-of-hardware-
> >> config---tf2176085.html#a6016661
> >> Sent from the Lucene - General forum at Nabble.com.
> >
> >
> >
>
> --
> View this message in context: http://www.nabble.com/Kind-of-hardware-
> config---tf2176085.html#a6021457
> Sent from the Lucene - General forum at Nabble.com.
Re: Kind of hardware config ? [ In reply to ]
Hey guys.

4Gb of RAM for an index of 2 million documents should really not be a
problem. You should consider separating the index from the actual content (
i.e, only save the index data in your index, not the html), if you have the
possibility to do that. I am not very comfortable with the very core
functionality in Lucene, but even if you stored the raw data with the index
data, only the index data should be held in memory and the raw data read
from disk with, if there's room, some caching.

With the numbers you mention James, it sounds like both the raw data and
index data is held in memory? If you have a good insight into the internals,
feel free to correct me on this issue... i'm also involved in applications
with very large indices, so this is very interesting.

Thanks,
Fredrik


On 8/28/06, James <james@ryley.com> wrote:
>
> OK, so you aren't going to get it into memory unless you spend a lot on
> servers. We haven't found memory (or disk access) to be a limiting factor
> anyway -- CPU is the issue. I'm not sure what you want to spend, but a
> single server with SATA RAID, 4GB RAM and the latest AMD processor will
> search your collection in ~10-20 seconds, depending on the complexity of
> the
> search. If you need faster performance or the ability to support many
> hits
> at once, you are going to have to parallelize the configuration across
> multiple servers using ParallelMultiSearcher.
>
> Keep in mind that Lucene isn't really set up to handle parallel searching
> robustly. There is a lot of code you are going to have to write for an
> enterprise-ready solution (e.g., checking the status of a given server to
> make sure it isn't down, redundantly storing indexes so that the search
> still functions if one server is down, potentially handling laggards to
> increase speed, etc.).
>
> We have done some of this, and have more to do -- it is a very non-trivial
> task.
>
> Sincerely,
> James Ryley, Ph.D.
>
> > -----Original Message-----
> > From: caribou_surf [mailto:eric@mixad.com]
> > Sent: Monday, August 28, 2006 10:42 AM
> > To: general@lucene.apache.org
> > Subject: RE: Kind of hardware config ?
> >
> >
> > About 100 Giga
> >
> >
> >
> > James-10 wrote:
> > >
> > > What's the total document size?
> > >
> > > Sincerely,
> > > James Ryley, Ph.D.
> > >
> > >> -----Original Message-----
> > >> From: caribou_surf [mailto:eric@mixad.com]
> > >> Sent: Monday, August 28, 2006 5:01 AM
> > >> To: general@lucene.apache.org
> > >> Subject: Kind of hardware config ?
> > >>
> > >>
> > >> We want to index about 2 millions of html documents with Lucune.
> > >> Have you an idea of the machine configuration the most adapted (bi
> > proc,
> > >> 2
> > >> Go on memrory, raid disks...) ?
> > >> --
> > >> View this message in context: http://www.nabble.com/Kind-of-hardware-
> > >> config---tf2176085.html#a6016661
> > >> Sent from the Lucene - General forum at Nabble.com.
> > >
> > >
> > >
> >
> > --
> > View this message in context: http://www.nabble.com/Kind-of-hardware-
> > config---tf2176085.html#a6021457
> > Sent from the Lucene - General forum at Nabble.com.
>
>
RE: Kind of hardware config ? [ In reply to ]
Hi,

He said that he has 100GB of data -- the number of documents is somewhat
unimportant. 100GB of data is going to end up being 30-60GB of index,
depending on certain things like whether you want to store both a stemmed
and unstemmed index (we do, to give the user the option of how they want to
search). No way you are going to get that much data into memory on a normal
server -- half the memory will be used by the OS, JVM, etc. I just
specified 4GB to give plenty for the general machine processes to work with,
but assumed that you will almost never be referring to actual data or
indexes in RAM.

Sincerely,
James Ryley, Ph.D.

> -----Original Message-----
> From: Fredrik Andersson [mailto:fidde.andersson@gmail.com]
> Sent: Tuesday, August 29, 2006 4:29 AM
> To: general@lucene.apache.org
> Subject: Re: Kind of hardware config ?
>
> Hey guys.
>
> 4Gb of RAM for an index of 2 million documents should really not be a
> problem. You should consider separating the index from the actual content
> (
> i.e, only save the index data in your index, not the html), if you have
> the
> possibility to do that. I am not very comfortable with the very core
> functionality in Lucene, but even if you stored the raw data with the
> index
> data, only the index data should be held in memory and the raw data read
> from disk with, if there's room, some caching.
>
> With the numbers you mention James, it sounds like both the raw data and
> index data is held in memory? If you have a good insight into the
> internals,
> feel free to correct me on this issue... i'm also involved in applications
> with very large indices, so this is very interesting.
>
> Thanks,
> Fredrik
>
>
> On 8/28/06, James <james@ryley.com> wrote:
> >
> > OK, so you aren't going to get it into memory unless you spend a lot on
> > servers. We haven't found memory (or disk access) to be a limiting
> factor
> > anyway -- CPU is the issue. I'm not sure what you want to spend, but a
> > single server with SATA RAID, 4GB RAM and the latest AMD processor will
> > search your collection in ~10-20 seconds, depending on the complexity of
> > the
> > search. If you need faster performance or the ability to support many
> > hits
> > at once, you are going to have to parallelize the configuration across
> > multiple servers using ParallelMultiSearcher.
> >
> > Keep in mind that Lucene isn't really set up to handle parallel
> searching
> > robustly. There is a lot of code you are going to have to write for an
> > enterprise-ready solution (e.g., checking the status of a given server
> to
> > make sure it isn't down, redundantly storing indexes so that the search
> > still functions if one server is down, potentially handling laggards to
> > increase speed, etc.).
> >
> > We have done some of this, and have more to do -- it is a very non-
> trivial
> > task.
> >
> > Sincerely,
> > James Ryley, Ph.D.
> >
> > > -----Original Message-----
> > > From: caribou_surf [mailto:eric@mixad.com]
> > > Sent: Monday, August 28, 2006 10:42 AM
> > > To: general@lucene.apache.org
> > > Subject: RE: Kind of hardware config ?
> > >
> > >
> > > About 100 Giga
> > >
> > >
> > >
> > > James-10 wrote:
> > > >
> > > > What's the total document size?
> > > >
> > > > Sincerely,
> > > > James Ryley, Ph.D.
> > > >
> > > >> -----Original Message-----
> > > >> From: caribou_surf [mailto:eric@mixad.com]
> > > >> Sent: Monday, August 28, 2006 5:01 AM
> > > >> To: general@lucene.apache.org
> > > >> Subject: Kind of hardware config ?
> > > >>
> > > >>
> > > >> We want to index about 2 millions of html documents with Lucune.
> > > >> Have you an idea of the machine configuration the most adapted (bi
> > > proc,
> > > >> 2
> > > >> Go on memrory, raid disks...) ?
> > > >> --
> > > >> View this message in context: http://www.nabble.com/Kind-of-
> hardware-
> > > >> config---tf2176085.html#a6016661
> > > >> Sent from the Lucene - General forum at Nabble.com.
> > > >
> > > >
> > > >
> > >
> > > --
> > > View this message in context: http://www.nabble.com/Kind-of-hardware-
> > > config---tf2176085.html#a6021457
> > > Sent from the Lucene - General forum at Nabble.com.
> >
> >