Mailing List Archive

sanity check - index size
I'm indexing 900+ files (less than 1,000) that total about 15MB in size.
These are text files and HTML files. I only index them into a few fields
(title, content, filename). My index (specifically _sd.fdt) is 20MB. The
bulk of the HTML files are Javadoc files (Ant's own documentation,
actually).

Does that seem at all close to being reasonable/normal? I am calling
optimize() before closing the index.

Thanks for the sanity check.

Erik



--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>
Re: sanity check - index size [ In reply to ]
On Mon, 20 May 2002, Erik Hatcher wrote:

> I'm indexing 900+ files (less than 1,000) that total about 15MB in size.
> My index (specifically _sd.fdt) is 20MB.
>
> Does that seem at all close to being reasonable/normal? I am calling
> optimize() before closing the index.

hi,

I've wondered the same thing. The indexes I build with Lucene are
generally around the same size as the corpus. That was larger than I
thought it would be, but it doesn't really matter since disk is pretty
cheap (and my corpus isn't very big).

-- James


--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>
Re: sanity check - index size [ In reply to ]
This seems big depending on what you are storing.

For example, I have a set of data with 457MB and my Lucene index is 115MB.
However, I don't store much.

If you are storing the complete text (even if you don't index it), then it
will be about the same size (no probably bigger) than your original data
set.

--Peter

On 5/20/02 4:16 PM, "Erik Hatcher" <lists@ehatchersolutions.com> wrote:

> I'm indexing 900+ files (less than 1,000) that total about 15MB in size.
> These are text files and HTML files. I only index them into a few fields
> (title, content, filename). My index (specifically _sd.fdt) is 20MB. The
> bulk of the HTML files are Javadoc files (Ant's own documentation,
> actually).
>
> Does that seem at all close to being reasonable/normal? I am calling
> optimize() before closing the index.
>
> Thanks for the sanity check.
>
> Erik
>
>
>
> --
> To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
> For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>
>
>


--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>
Re: sanity check - index size [ In reply to ]
Erik,

Mine is amazingly small too. I store a very small field (a key into my
MySQL database for the full document) and index the content without
storing it. I have 40,000+ docs and it is only 28MB in size. I've done
several searches and it seems to be correct.

James

*********************************************************
The Game Development Search Engine
and DQuest E-zine
http://www.gdse.com/

A Member of the Future Games Network
http://www.fgn.com/


On Mon, 20 May 2002, Erik Hatcher wrote:

> I'm indexing 900+ files (less than 1,000) that total about 15MB in size.
> These are text files and HTML files. I only index them into a few fields
> (title, content, filename). My index (specifically _sd.fdt) is 20MB. The
> bulk of the HTML files are Javadoc files (Ant's own documentation,
> actually).
>
> Does that seem at all close to being reasonable/normal? I am calling
> optimize() before closing the index.
>
> Thanks for the sanity check.
>
> Erik
>
>
>
> --
> To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
> For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>
>
>


--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>
Re: sanity check - index size [ In reply to ]
Duh... that was my issue. I am storing the content also. Sorry for the
newbie question. I'll crawl back under my rock now.....


----- Original Message -----
From: "Peter Carlson" <carlson@bookandhammer.com>
To: "Lucene Users List" <lucene-user@jakarta.apache.org>
Sent: Monday, May 20, 2002 8:50 PM
Subject: Re: sanity check - index size


> This seems big depending on what you are storing.
>
> For example, I have a set of data with 457MB and my Lucene index is 115MB.
> However, I don't store much.
>
> If you are storing the complete text (even if you don't index it), then it
> will be about the same size (no probably bigger) than your original data
> set.
>
> --Peter
>
> On 5/20/02 4:16 PM, "Erik Hatcher" <lists@ehatchersolutions.com> wrote:
>
> > I'm indexing 900+ files (less than 1,000) that total about 15MB in size.
> > These are text files and HTML files. I only index them into a few
fields
> > (title, content, filename). My index (specifically _sd.fdt) is 20MB.
The
> > bulk of the HTML files are Javadoc files (Ant's own documentation,
> > actually).
> >
> > Does that seem at all close to being reasonable/normal? I am calling
> > optimize() before closing the index.
> >
> > Thanks for the sanity check.
> >
> > Erik
> >
> >
> >
> > --
> > To unsubscribe, e-mail:
<mailto:lucene-user-unsubscribe@jakarta.apache.org>
> > For additional commands, e-mail:
<mailto:lucene-user-help@jakarta.apache.org>
> >
> >
>
>
> --
> To unsubscribe, e-mail:
<mailto:lucene-user-unsubscribe@jakarta.apache.org>
> For additional commands, e-mail:
<mailto:lucene-user-help@jakarta.apache.org>
>
>


--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>