Mailing List Archive: Homogeneous vs Heterogeneous indexes (was: FileNotFoundException)

Homogeneous vs Heterogeneous indexes (was: FileNotFoundException)

petite_abeille at mac

Apr 29, 2002, 10:54 AM

Post #1 of 8 (986 views)

First of, thanks to Jagadesh Nandasamy who directed me to the right
direction.

It seems, that in my situation, more homogeneous indexes work better
than fewer heterogeneous indexes:

I have a dozen class that I'm indexing. They vary from two fields to
more than a dozen field per document (aka object). I went through
different indexing strategy with them (per class, per date, per root
class, ... ) to see how it goes. In any case, while trying to use my
stuff with rc4 I consolidated all my different class indexes into one
root class index to see if I could reduce my resources consumption. Less
indexes, less RandomAccessFile was the rational. Well, I was wrong. In
fact the exact opposite seems to hold true: more -homogeneous- indexes
use overall less RandomAccessFile than less -heterogeneous- indexes...

One of those -not so obvious- thing you have to learn the hard way I
guess... ;-)

In any case, I would like to thanks again Jagadesh for his insight. Also
thanks to Pier Fumagalli for pointing out "LSOF". A very handy tool
indeed.

As a final note, several people suggested to increase the number of file
descriptors per process with something like "ulimit"... From what I
learned today, I think it's a *bad* idea to have to change some system
parameters just because your/my app is written in such a way that it may
run out of some system resources. Your/my app has to fit in the system.
Hacking "ulimit" and/or other system parameters is just a quick patch
that will -at best- delay dealing with the real problem that's usually
one of design.

Just my two cents.

PA.

--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>

Re: Homogeneous vs Heterogeneous indexes (was: FileNotFoundException) [ In reply to ]

Apr 29, 2002, 12:44 PM

Post #2 of 8 (973 views)

On Mon, 29 Apr 2002, petite_abeille wrote:

> As a final note, several people suggested to increase the number of
> file descriptors per process with something like "ulimit"... From what
> I learned today, I think it's a *bad* idea to have to change some
> system parameters just because your/my app is written in such a way
> that it may run out of some system resources. Your/my app has to fit
> in the system. Hacking "ulimit" and/or other system parameters is
> just a quick patch that will -at best- delay dealing with the real
> problem that's usually one of design.

If you have a suggestion for how Lucene could use fewer file descriptors
while still maintaining its performance, I'm sure that the developers
would be interested to hear it.

However, some programs do require more resources than others. If--as I
suspect is true in this case--this is a consequence of the complexity of
the task, then there's not much point in complaining about it.

Joshua

jmadden@ics.uci.edu...Obscurium Per Obscurius...www.ics.uci.edu/~jmadden
Joshua Madden: Information Scientist, Musician, Philosopher-At-Tall
It's that moment of dawning comprehension that I live for--Bill Watterson
My opinions are too rational and insightful to be those of any organization.

--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>

Re: Homogeneous vs Heterogeneous indexes (was: FileNotFoundException) [ In reply to ]

puffmail at darksleep

Apr 29, 2002, 4:57 PM

Post #3 of 8 (976 views)

petite,

On Mon, Apr 29, 2002 at 07:54:43PM +0200, petite_abeille wrote:
> As a final note, several people suggested to increase the number of file
> descriptors per process with something like "ulimit"...

Just be glad you aren't doing this on Solaris with JDK 1.1.6,
where I first ran into ulimit issues - back when I encountered this
problem, the solaris default ulimit setting was 24 files, and JDK
1.1.6 reported the problem as an "OutOfMemory" error! Looks like
things are improving :-).

> From what I learned today, I think it's a *bad* idea to have to
> change some system parameters just because your/my app is written in
> such a way that it may run out of some system resources. Your/my app
> has to fit in the system. Hacking "ulimit" and/or other system
> parameters is just a quick patch that will -at best- delay dealing
> with the real problem that's usually one of design.

Yes and no. Setting ulimit to a reasonable number of open files
is not only not a patch, it's the "right" way to do it. I understand
where you're coming from, really, and in a certain way, it makes
sense, BUT... sometimes the impulse for clean, good design takes you
too far down a blind alley. Sometimes there is no elegant solution.
Sometimes there is no "best" way, only one of a limited set of options
with different tradeoffs.

By definition, Lucene is an application that trades off up front
CPU (for indexing) and file resources (for storage) for request-time
speed. The OS's job is to manage resources, and open files are one of
those resources. That's the tradeoff here, and it's reasonable and
expected. Most serious applications have to have some sort of OS
variable tweaking, you're just used to having it done invisibly and
painlessly.

That said, since you're working on a client/desktop application,
not a server application, you need to think about ways to handle this:

You could figure out the "right" way to set the system
configuration on install or launch.

You could look at the alternative techniques for indexing in
Lucene, and see if any approaches there can help - for example, maybe
doing a lot of the more intense indexing work in a RAMDirectory, then
merging it into a normal file-based Directory.

You could look more closely at what your application is doing,
and see if there's anything you're doing wrong (perhaps opening files
and not closing them, and leaving them for the garbage collector to
eventually get around to closing?) or if you have a pessimal usage
pattern that exacerbates the situation.

You could take a closer look at the lucene indexing and file
management stuff, and see if you can come up with a scheme to run
Lucene indexing with modified code for keeping track of file
resources.

I'll bet Doug and the other developers would rather not add
open-file managmeent as a main, permanent part of lucene, since it
would add overhead to all uses of lucene just to deal with an
anomalous situation (use on a client/desktop machine). But they might
be interested in a way to offer it as an optional feature, where
people using lucene in a constrained environment could configure
lucene to be careful about how many files it keeps open at any given
time.

Steven J. Owens
puff@darksleep.com

--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>

Re: Homogeneous vs Heterogeneous indexes (was: FileNotFoundException) [ In reply to ]

petite_abeille at mac

Apr 30, 2002, 12:03 AM

Post #4 of 8 (972 views)

On Tuesday, April 30, 2002, at 01:57 AM, Steven J. Owens wrote:

> Just be glad you aren't doing this on Solaris with JDK 1.1.6

I know... In fact I'm looking forward to port my stuff to 1.4... As my
app is very much IO bond I'm really excited by this nio madness... :-)

> Yes and no. Setting ulimit to a reasonable number of open files is not
> only not a patch, it's the "right" way to do it.

Of course... Nothing is really black or white... What I wanted to say is
that -as a first strike- *I* prefer not to mess around with system
parameters.

> I understand where you're coming from, really, and in a certain way,
> it makes sense

Thanks. I already feel less alone... ;-)

> BUT... sometimes the impulse for clean, good design takes you too far
> down a blind alley.

Sure. At the end of the day, everything is a tradeoff...

> Sometimes there is no elegant solution. Sometimes there is no "best"
> way, only one of a limited set of options with different tradeoffs.

Absolutely.

> Most serious applications have to have some sort of OS variable
> tweaking, you're just used to having it done invisibly and painlessly.

Agree. In fact that's my first desktop application for nearly a decade.
I usually work on large scale system. And let me tell you, it's a very
different pair of sleeves... ;-)

> You could figure out the "right" way to set the system configuration
> on install or launch.

One of my design "goal" is to try to avoid these sort of tweakings as
much as I can.

> You could look at the alternative techniques for indexing in Lucene

That's another one of those nasty tradeoffs... ;-) Memory is even more
precious than file descriptors in my situation... Specially with a jvm
that have this funky notion of constraining your memory usage...

> if there's anything you're doing wrong (perhaps opening files and not
> closing them, and leaving them for the garbage collector to eventually
> get around to closing?)

Sure. I went through all those sanity checks. Also, in my case, the
garbage collector is my friend as I'm using the java.lang.ref API
extensively.

> or if you have a pessimal usage pattern that exacerbates the situation.

Ummmm...?!? You lost me here... What's a "pessimal usage pattern"?

> if you can come up with a scheme to run Lucene indexing with modified
> code for keeping track of file resources.

Sure, there are many thing that one could do... However, I have to
balance how much time I want to invest into any one of those allays. One
thing I really like about Lucene is it very simple API and usage. So far
it has worked out pretty well for me as I'm using it pretty extensively.
And I seem to have found -at last- a good balance between the different
constrains I'm operating under.

> an anomalous situation (use on a client/desktop machine)

"Anomalous situation"?!?! Ummm... Lucene is just an API... Hopefully
it's not bundled with some "dogma" attached to it... However, I'm kind
of starting to wander about that considering some of the -very
defensive- responses I got to my postings... Oh, well... I will just go
back to my cave... :-(

> could configure lucene to be careful about how many files it keeps open
> at any given time.

That will be great! On a somewhat related note, I have decided to stick
with the com.lucene package for the time being.... I was pretty excited
when the rc stuff came out, but it just didn't work out for me. My
resources problem just went from bad to worse. And also, I have two
issues with the release candidate: locking and reference counting.

Locking. I don't have anything against locking per see. However, I
really don't like how it's implemented in the rc. Using files just do
not work for me. It creates too many problems when something goes wrong
(eg the app is killed without warning and I have to clean up all those
locks by myself). What about using sockets or something to rendez-vous
on an index? Or at a bare minimum, be able to disable the locking all
together. I understand that most people are using Lucene under a very
different setup that I do, but nevertheless it should not hurt to make
it configurable. Anyway, it does not work for older jvm as noted in the
source code. Last, but not least, I'm always get very scared when I see
some "platform" dependent code somewhere (eg "if version 1 then ") ;-)

Reference counting. Well, as noted in a comment in the source code, the
reference API is really the way to go... And trying to be backward
compatible to version 0.9 is somehow missing the forest for the tree...
Just my two cents in any case. And yes, I'm well aware that I can fix
all these issue by myself... And start to contribute to Lucene instead
of just ranting left and right... But also keep in mind that I'm just a
humble Lucene user. And there seem to be a very clear distinction
between "user" and "developer" in Lucene's world... ;-)

Thanks for your response in any case. I hope I didn't "offend" too many
people with my ramblings ;-)

PA.

--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>

Re: Homogeneous vs Heterogeneous indexes (was: FileNotFoundException) [ In reply to ]

dmitrys at earthlink

Apr 30, 2002, 3:41 PM

Post #5 of 8 (972 views)

Just a couple of clarification points:
- the number of files that Lucene uses depends on the number of segments
in the index and the number of *stored* fields
- if your fields are not stored but only indexed, they do not require
separate files. Otherwise, an .fnn file is created for each field.
- if at least one document uses a given field name in an index, that
index requires the .fnn file for that field
- index segments are created when documents are added to the index. For
each 10 docs you get a new segment.
- optimizing the index removes all segments are replaces them with one
new segment that contains all of the documents
- optimization is done periodically as more documents are added
(controlled by IndexWriter.mergeFactor), but can be done manually
whenever needed

With all this, I think Lucene does use too many files...
Some additional info: there is a field on IndexWriter called infoStream.
If this is set to a PrintStream (such as System.out), various diagnostic
messages about the merging process will be printed to that stream. You
might find this helpful in tuning the merge parameters.

Hope this helps.
Good luck.

Dmitry.

--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>

Re: Homogeneous vs Heterogeneous indexes (was: FileNotFoundException) [ In reply to ]

petite_abeille at mac

Apr 30, 2002, 11:37 PM

Post #6 of 8 (989 views)

On Wednesday, May 1, 2002, at 12:41 AM, Dmitry Serebrennikov wrote:

> - the number of files that Lucene uses depends on the number of
> segments in the index and the number of *stored* fields
> - if your fields are not stored but only indexed, they do not require
> separate files. Otherwise, an .fnn file is created for each field.

Ok. That's good as all my fields are indexed but not stored in Lucene.
Only one field is stored in any one index: the uuid of an object (as a
Keyword).

> - if at least one document uses a given field name in an index, that
> index requires the .fnn file for that field

Ok. So, in theory, more homogeneous index should use less files all
things being equal?

> - index segments are created when documents are added to the index. For
> each 10 docs you get a new segment.
> - optimizing the index removes all segments are replaces them with one
> new segment that contains all of the documents
> - optimization is done periodically as more documents are added
> (controlled by IndexWriter.mergeFactor), but can be done manually
> whenever needed

Ok. When doing the optimization, are there any temporary files getting
created?

> With all this, I think Lucene does use too many files...

That's my impression also...

> Some additional info: there is a field on IndexWriter called
> infoStream. If this is set to a PrintStream (such as System.out),
> various diagnostic messages about the merging process will be printed
> to that stream.

Yep. I guess I overlooked that.

> You might find this helpful in tuning the merge parameters.

Just to make sure: using a small merge factor (eg 2) will reduce the
number of files or just optimize (aka merge) the index more often?

> Hope this helps.
> Good luck.

Thanks. Very helpful indeed :-)

R.

--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>

Re: Homogeneous vs Heterogeneous indexes (was: FileNotFoundException) [ In reply to ]

dmitrys at earthlink

May 1, 2002, 1:33 PM

Post #7 of 8 (985 views)

Subject:

Re: Homogeneous vs Heterogeneous indexes (was: FileNotFoundException)
From:

petite_abeille <petite_abeille@mac.com>
Date:

Wed, 1 May 2002 08:37:51 +0200

To:

"Lucene Users List" <lucene-user@jakarta.apache.org>

On Wednesday, May 1, 2002, at 12:41 AM, Dmitry Serebrennikov wrote:

> - the number of files that Lucene uses depends on the number of
> segments in the index and the number of *stored* fields
> - if your fields are not stored but only indexed, they do not
> require separate files. Otherwise, an .fnn file is created for
> each field.

Ok. That's good as all my fields are indexed but not stored in
Lucene. Only one field is stored in any one index: the uuid of an
object (as a Keyword).

> - if at least one document uses a given field name in an index,
> that index requires the .fnn file for that field

Ok. So, in theory, more homogeneous index should use less files all
things being equal?

I think so... I guess you have many kinds of documents that have some
fields in common and some unique? Yes, then having the same kinds of
documents in a given index will reduce the total number of files.
Personally, I don't have experience with this since all of my documents
have the same fields.

> - index segments are created when documents are added to the
> index. For each 10 docs you get a new segment.
> - optimizing the index removes all segments are replaces them with
> one new segment that contains all of the documents
> - optimization is done periodically as more documents are added
> (controlled by IndexWriter.mergeFactor), but can be done manually
> whenever needed

Ok. When doing the optimization, are there any temporary files
getting created?

Nope, just the files for the new segment. Well, I think the segments and
deleted files might have "segments.new" and "deleted.new" while they are
being modified, with the old ones removed and new ones renamed afterwards.

> Some additional info: there is a field on IndexWriter called
> infoStream. If this is set to a PrintStream (such as System.out),
> various diagnostic messages about the merging process will be
> printed to that stream.

Yep. I guess I overlooked that.

> You might find this helpful in tuning the merge parameters.

Just to make sure: using a small merge factor (eg 2) will reduce the
number of files or just optimize (aka merge) the index more often?

It will optimize more often and, since optimization replaces all
segments with one, the number of files will drop. However, the old files
will stay around until they are no longer in use by pre-existing
IndexReader instances, so that may be another catch.

Dmitry.

--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>

Re: Homogeneous vs Heterogeneous indexes (was: FileNotFoundException) [ In reply to ]

petite_abeille at mac

May 3, 2002, 2:53 AM

Post #8 of 8 (972 views)

On Wednesday, May 1, 2002, at 10:33 PM, Dmitry Serebrennikov wrote:

> I think so... I guess you have many kinds of documents that have some
> fields in common and some unique?

Yes. I'm using Lucene as a backbone for a kind of oodbms. Therefore you
can index any type of object that may greatly vary in their complexity.

> Nope, just the files for the new segment. Well, I think the segments
> and deleted files might have "segments.new" and "deleted.new" while
> they are being modified, with the old ones removed and new ones renamed
> afterwards.

Good.

> It will optimize more often and, since optimization replaces all
> segments with one, the number of files will drop. However, the old
> files will stay around until they are no longer in use by pre-existing
> IndexReader instances, so that may be another catch.

Ok, seems to be consistent with what I'm seeing.

Thanks for your explanation.

PA.

--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>