Mailing List Archive

CachedSearcher
Hello!
A lot of people requested a code to cache opened Searcher objects until the index is not modified. The first version of this was writed by Scott Ganyo and submitted as IndexAccessControl to the list.

Now I've decoupled the logic that is needed to manage searher.

The usage is very simple:
IndexSearcherCache isc = new IndexSearcherCache(new File("/path/to/the/index"));
for(int i= 0; i++; i< 100) {
Searcher searcher = isc.getSearcher();
// search here
searcher.close();
}

only one Searcher will be opened here if no other thread is writing the index; if the index was modified getSearcher() will close the old one and create a new.


Unfortunatly to compile and use this code one has to modify the lucene source:

1. change all package-protected abstract method to public in Searcher.java

/** Frees resources associated with this Searcher. */
abstract public void close() throws IOException;

abstract int docFreq(Term term) throws IOException;
abstract int maxDoc() throws IOException;
abstract TopDocs search(Query query, Filter filter, int n)
throws IOException;


/** Frees resources associated with this Searcher. */
public abstract void close() throws IOException;

public abstract int docFreq(Term term) throws IOException;
public abstract int maxDoc() throws IOException;
public abstract TopDocs search(Query query, Filter filter, int n)
throws IOException;

2 change package protected TopDocs to public (in TopDocs.java)
final class TopDocs { --> public final class TopDocs {


Or you can use the modified files I've attached.

I hope this code is helpful.

The main idea to have an interface SearcherSource something similar to DataSource in javax.sql. SearcherSource is responsible for creating searcher object. One implementation is SearcherCache that encapsulates the logic of caching searcher. IndexSearcherCache - as you might figure out - can cache IndexSearcher objects. Someone could implement a MultiSearcherCache class that manages... (recreates the searcher if one of the searchers need reopening).

I create IndexSearcherCache in my init method and pass the object as a SearcherSource to the working methods. In the destroy process I call release() method. In this way I can later change the implementation of the cache as far as it implementing SearcherSource.

peter

ps: of cource you can change the code, class/method/package/.. names;
Unfortunatly a lot of System.out.println debugging code is used but it is very good to understand the behaviour.
Re: CachedSearcher [ In reply to ]
Halácsy Péter wrote:
> A lot of people requested a code to cache opened Searcher objects until the index is not modified. The first version of this was writed by Scott Ganyo and submitted as IndexAccessControl to the list.
>
> Now I've decoupled the logic that is needed to manage searher.
>
> Unfortunatly to compile and use this code one has to modify the lucene source:

Why is this more complicated than the code in demo/Search.jhtml
(included below)? FSDirectory closes files as they're GC'd, so you
don't have to explicitly close the IndexReaders or Searchers.

Doug

/** Keep a cache of open IndexReader's, so that an index does not
* have to opened for each query. The cache re-opens an index when
* it has changed so that additions and deletions are visible ASAP.
*/

static Hashtable indexCache = new Hashtable(); // name->CachedIndex

class CachedIndex { // a cache entry
IndexReader reader; // an open reader
long modified; // reader's mod. date

CachedIndex(String name) throws IOException {
modified = IndexReader.lastModified(name); // get mod. date
reader = IndexReader.open(name); // open reader
}
}

IndexReader getReader(String name) throws ServletException {
CachedIndex index = // look in cache
(CachedIndex)indexCache.get(name);

try {
if (index != null && // check up-to-date
(index.modified == IndexReader.lastModified(name)))
return index.reader; // cache hit
else {
index = new CachedIndex(name); // cache miss
}
} catch (IOException e) {
StringWriter writer = new StringWriter();
PrintWriter pw = new PrintWriter(writer);
throw new ServletException("Could not open index " + name + ": " +
e.getClass().getName() + "--" +
e.getMessage());
}

indexCache.put(name, index); // add to cache
return index.reader;
}


--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>
Re: CachedSearcher [ In reply to ]
>FSDirectory closes files as they're GC'd, so you
>don't have to explicitly close the IndexReaders or Searchers.
>
>Doug
>

hmmm...is this documented somewhere? I go through quite abit of trouble
just to close Searchers (because Hits become invalid when the Searcher is
closed).

If the object has a close() method with public modifier, isn't it a common
idiom that client code needs to invoke close() explicitly? If there's no
real need to call close, maybe it can be changed to protected?

Regards,
Kelvin


--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>
Re: CachedSearcher [ In reply to ]
On Monday, July 15, 2002, at 10:19 PM, Kelvin Tan wrote:

>> FSDirectory closes files as they're GC'd, so you
>> don't have to explicitly close the IndexReaders or Searchers.
>>
>> Doug
>>
>
> hmmm...is this documented somewhere? I go through quite abit of trouble
> just to close Searchers (because Hits become invalid when the
> Searcher is
> closed).
>
> If the object has a close() method with public modifier, isn't it a
> common
> idiom that client code needs to invoke close() explicitly?

I absolutely agree. If letting it get GC'ed is fine, then just about
any other name, like "dispose" might be better.

> If there's no
> real need to call close, maybe it can be changed to protected?

I wouldn't go that far.

~ David Smiley


--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>
RE: CachedSearcher [ In reply to ]
> -----Original Message-----
> From: Doug Cutting [mailto:cutting@lucene.com]
> Sent: Tuesday, July 16, 2002 1:00 AM
> To: Lucene Users List
> Subject: Re: CachedSearcher
>
>
> Why is this more complicated than the code in demo/Search.jhtml
> (included below)? FSDirectory closes files as they're GC'd, so you
> don't have to explicitly close the IndexReaders or Searchers.

I'll check this code, but I think it could hang up with a lot of opened IndexReader.
http://developer.java.sun.com/developer/TechTips/2000/tt0124.html

(If a lot of searcher is requested ant a writer is always modificating the index).

peter

--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>
Re: CachedSearcher [ In reply to ]
Halácsy Péter wrote:

> Hello!
> A lot of people requested a code to cache opened Searcher objects until the index is not modified. The first version of this was writed by Scott Ganyo and submitted as IndexAccessControl to the list.
>
> Now I've decoupled the logic that is needed to manage searher.
>
> The usage is very simple:
> IndexSearcherCache isc = new IndexSearcherCache(new File("/path/to/the/index"));
> for(int i= 0; i++; i< 100) {
> Searcher searcher = isc.getSearcher();
> // search here
> searcher.close();
> }
>
> only one Searcher will be opened here if no other thread is writing the index; if the index was modified getSearcher() will close the old one and create a new.
>
> Unfortunatly to compile and use this code one has to modify the lucene source:
>
> 1. change all package-protected abstract method to public in Searcher.java
>
> /** Frees resources associated with this Searcher. */
> abstract public void close() throws IOException;
>
> abstract int docFreq(Term term) throws IOException;
> abstract int maxDoc() throws IOException;
> abstract TopDocs search(Query query, Filter filter, int n)
> throws IOException;
>
> /** Frees resources associated with this Searcher. */
> public abstract void close() throws IOException;
>
> public abstract int docFreq(Term term) throws IOException;
> public abstract int maxDoc() throws IOException;
> public abstract TopDocs search(Query query, Filter filter, int n)
> throws IOException;
>
> 2 change package protected TopDocs to public (in TopDocs.java)
> final class TopDocs { --> public final class TopDocs {
>
> Or you can use the modified files I've attached.
>
> I hope this code is helpful.
>
> The main idea to have an interface SearcherSource something similar to DataSource in javax.sql. SearcherSource is responsible for creating searcher object. One implementation is SearcherCache that encapsulates the logic of caching searcher. IndexSearcherCache - as you might figure out - can cache IndexSearcher objects. Someone could implement a MultiSearcherCache class that manages... (recreates the searcher if one of the searchers need reopening).
>
> I create IndexSearcherCache in my init method and pass the object as a SearcherSource to the working methods. In the destroy process I call release() method. In this way I can later change the implementation of the cache as far as it implementing SearcherSource.
>
> peter
>
> ps: of cource you can change the code, class/method/package/.. names;
> Unfortunatly a lot of System.out.println debugging code is used but it is very good to understand the behaviour.
>
> ------------------------------------------------------------------------
> Name: CachedSearcher.zip
> CachedSearcher.zip Type: Zip Compressed Data (application/x-zip-compressed)
> Encoding: base64
> Description: CachedSearcher.zip
>
> Name: TopDocs.java
> TopDocs.java Type: unspecified type (application/octet-stream)
> Encoding: base64
> Description: TopDocs.java
>
> Name: Searcher.java
> Searcher.java Type: unspecified type (application/octet-stream)
> Encoding: base64
> Description: Searcher.java
>
> Part 1.5Type: Plain Text (text/plain)

I am new here, I am sorry if this question has been asked before. Why there are so many final and package-protected methods? I want to change the way TermQuery doing scores. Ideally, I would like to have subclasses of TermQuery and TermScorer, and place them in my OWN package. Currently, I have to put these two in lucene, and I have to copy almost every line of the TermQuery class into my new query class except the line returns Scorer. Note, this
may be a bad example, but I still want to know if we can make Lucene more extendable from outside in the future.


--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>
RE: CachedSearcher [ In reply to ]
I'd like to see the finalize() methods removed from Lucene entirely. In a
system with heavy load and lots of gc, using finalize() causes problems. To
wit:

1) I was at a talk at JavaOne last year where the gc performance experts
from Sun (the engineers actually writing the HotSpot gc) were giving
performance advice. They specifically stated that finalize() should be
avoided if at all possible because the following steps have to happen for
finalized objects:
a) register the object when created
b) notice the object when it becomes unreachable
c) finalize the object
d) notice the object when it becomes unreachable (again)
e) reclaim the object

This leads to the following effects in the vm:
a) allocation is slower
b) heap is bigger
c) gc pauses are longer

The Sun engineers recommended that if you really do need an automatic clean
up process, that Weak references are *much* more efficient and should be
used in preference to finalize().

2) External resources (i.e. file handles) are not released until the reader
is closed. And, as many have found, Lucene eats file handles for breakfast,
lunch, and dinner.

Scott

> -----Original Message-----
> From: Halácsy Péter [mailto:halacsy.peter@axelero.com]
> Sent: Tuesday, July 16, 2002 12:43 AM
> To: Lucene Users List
> Subject: RE: CachedSearcher
>
>
>
>
> > -----Original Message-----
> > From: Doug Cutting [mailto:cutting@lucene.com]
> > Sent: Tuesday, July 16, 2002 1:00 AM
> > To: Lucene Users List
> > Subject: Re: CachedSearcher
> >
> >
> > Why is this more complicated than the code in demo/Search.jhtml
> > (included below)? FSDirectory closes files as they're GC'd, so you
> > don't have to explicitly close the IndexReaders or Searchers.
>
> I'll check this code, but I think it could hang up with a lot
> of opened IndexReader.
> http://developer.java.sun.com/developer/TechTips/2000/tt0124.html
>
> (If a lot of searcher is requested ant a writer is always
> modificating the index).
>
> peter
>
> --
> To unsubscribe, e-mail:
> <mailto:lucene-user-unsubscribe@jakarta.apache.org>
> For additional commands, e-mail:
> <mailto:lucene-user-help@jakarta.apache.org>
>
Re: CachedSearcher [ In reply to ]
Kelvin Tan wrote:
> If the object has a close() method with public modifier, isn't it a common
> idiom that client code needs to invoke close() explicitly? If there's no
> real need to call close, maybe it can be changed to protected?

Yes, that is a common idiom. In the case of Lucene's FSDirectory, it's still a
good idea to close it when you know its no longer needed, to minimize the
number of open files, but sometimes it is difficult to know when it is no
longer needed. Finalizers are intended for precisely this purpose. But you're
right, probably this should be better documented.

Doug


--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>
Re: CachedSearcher [ In reply to ]
Scott Ganyo wrote:
> I'd like to see the finalize() methods removed from Lucene entirely. In a
> system with heavy load and lots of gc, using finalize() causes problems.
> [ ... ]
> External resources (i.e. file handles) are not released until the reader
> is closed. And, as many have found, Lucene eats file handles for breakfast,
> lunch, and dinner.

Lucene does open and close lots of files relative to many other applications,
but the number of files opened is still many orders of magnitude less than the
number of other objects allocated. I would be very surprised if finalizers for
the hundreds of files that Lucene might open in a session would have any
measurable impact on garbage collector performance given the millions of other
objects that the garbage collector might process in that session.

As usual, one should not make performance claims without performing benchmarks.
It would be a simple matter to comment out the finalize() methods, recompile
and compare indexing and search speed. If the improvement is significant, then
we can consider removing finalize methods.

Doug


--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>
RE: CachedSearcher [ In reply to ]
Point taken. Indeed, these were general recommendations that may/may not
have a strong impact on Lucene's specific use of finalization. My only
specific performance claim is that there will be a negative impact of some
degree using finalizers. Whether that impact is noticable or not will
probably depend upon a number of factors. So I will avoid making any
further judgements on the impact of finalization in Lucene on the
performance until I have proof.

Benchmarks aside, my point on the file handles is something that hit us
square between the eyes. Before we started caching and explicitly closing
our Searchers we would regularly run out of file handles because of Lucene.
This was despite increasing our allocated file handles to ludicrous levels
in the OS. I would recommend that, in general, Java developers would be
well advised to explicitly release external resources when done with them
rather than allowing finalization to take care of it.

Scott

> -----Original Message-----
> From: Doug Cutting [mailto:cutting@lucene.com]
> Sent: Tuesday, July 16, 2002 11:56 AM
> To: Lucene Users List
> Subject: Re: CachedSearcher
>
>
> Scott Ganyo wrote:
> > I'd like to see the finalize() methods removed from Lucene
> entirely. In a
> > system with heavy load and lots of gc, using finalize()
> causes problems.
> > [ ... ]
> > External resources (i.e. file handles) are not released
> until the reader
> > is closed. And, as many have found, Lucene eats file
> handles for breakfast,
> > lunch, and dinner.
>
> Lucene does open and close lots of files relative to many
> other applications,
> but the number of files opened is still many orders of
> magnitude less than the
> number of other objects allocated. I would be very surprised
> if finalizers for
> the hundreds of files that Lucene might open in a session
> would have any
> measurable impact on garbage collector performance given the
> millions of other
> objects that the garbage collector might process in that session.
>
> As usual, one should not make performance claims without
> performing benchmarks.
> It would be a simple matter to comment out the finalize()
> methods, recompile
> and compare indexing and search speed. If the improvement is
> significant, then
> we can consider removing finalize methods.
>
> Doug
>
>
> --
> To unsubscribe, e-mail:
<mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail:
<mailto:lucene-user-help@jakarta.apache.org>
Re: CachedSearcher [ In reply to ]
Hang Li wrote:
> Why there are so many final and package-protected methods?

The package private stuff was motivated by Javadoc. When I wrote Lucene I
wanted the Javadoc to make it easy to use. Thus I did not want the Javadoc
cluttered with lots of methods that 99% of users did not need to know about.

So a problem is how to distinguish methods that are meant for end users from
those that only may rarely be needed by an expert developer. Perhaps we could
establish a Javadoc convention for those methods that most users don't need to
know about. For example, their documentation could begin "Expert:" or
something. What do folks think of that?

Also, many package private methods really are internal methods that are not
designed to be called outside of the implementation. Trying to override them
probably won't work. When stuff that is tricky to use is documented and easy
to use, folks will use it, it won't work, and they'll complain, wasting
everyone's time. So we must be careful about what is made public. I would
rather err on the side of exposing less than more--folks who know what they're
doing can always add code into a lucene package. It's not ideal, but it works.

Some 'final' declarations made a performance difference when javac did
inlining, but no longer do, and should probably be removed now. Some still
keep people from subclassing things that are not designed to be subclassed. So
these should also be considered on a case-by-case basis.

> I want to change the way TermQuery doing scores.

Could you please make a proposal to the lucene-dev list of which methods and
classes should be made public or protected or non-final, and what documentation
should be added?

Thanks,

Doug


--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>
Re: CachedSearcher [ In reply to ]
On Tue, 16 Jul 2002, Doug Cutting wrote:

> Hang Li wrote:
> > Why there are so many final and package-protected methods?
>
> The package private stuff was motivated by Javadoc. When I wrote
> Lucene I wanted the Javadoc to make it easy to use. Thus I did not
> want the Javadoc cluttered with lots of methods that 99% of users did
> not need to know about.
>
> So a problem is how to distinguish methods that are meant for end
> users from those that only may rarely be needed by an expert
> developer. Perhaps we could establish a Javadoc convention for those
> methods that most users don't need to know about. For example, their
> documentation could begin "Expert:" or something. What do folks think
> of that?

I think that this is a good idea, which perhaps ought to be combined with
a note on the top level of the documentation which would read something
like "Methods marked in the documentation with 'Expert' should only be
used by experienced users of Lucene; /caveat coder/."

> Also, many package private methods really are internal methods that
> are not designed to be called outside of the implementation. Trying
> to override them probably won't work.

Certainly things like this should be left (package) private.

> When stuff that is tricky to use is documented and easy to use, folks
> will use it, it won't work, and they'll complain, wasting everyone's
> time. So we must be careful about what is made public. I would
> rather err on the side of exposing less than more--folks who know what
> they're doing can always add code into a lucene package. It's not
> ideal, but it works.

This seems reasonable enough for most purposes. I do wonder, though,
whether there's a "gotcha" that can arise as an unexpected side effect of
including a file in a package (for this purpose) that otherwise wouldn't
need to be included. (I don't use packages much myself; can a piece of
code be part of more than one package?) If nothing else, such inclusion
might be somewhat mysterious to later maintainers of that code. This kind
of modification might also make it more difficult for people to get Lucene
contributions from more than one source to work together.

Regards,

Joshua O'Madadhain

jmadden@ics.uci.edu...Obscurium Per Obscurius...www.ics.uci.edu/~jmadden
Joshua O'Madadhain: Information Scientist, Musician, Philosopher-At-Tall
It's that moment of dawning comprehension that I live for--Bill Watterson
My opinions are too rational and insightful to be those of any organization.








--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>
RE: CachedSearcher [ In reply to ]
> -----Original Message-----
> From: Doug Cutting [mailto:cutting@lucene.com]
> Sent: Tuesday, July 16, 2002 6:44 PM
> To: Lucene Users List
> Subject: Re: CachedSearcher
>
>
> Kelvin Tan wrote:
> > If the object has a close() method with public modifier,
> isn't it a common
> > idiom that client code needs to invoke close() explicitly?
> If there's no
> > real need to call close, maybe it can be changed to protected?
>
> Yes, that is a common idiom. In the case of Lucene's
> FSDirectory, it's still a
> good idea to close it when you know its no longer needed, to
> minimize the
> number of open files, but sometimes it is difficult to know
> when it is no
> longer needed. Finalizers are intended for precisely this
> purpose. But you're
> right, probably this should be better documented.
>
> Doug
>
>

Doug!
I made an IndexReaderCache class from the code you have sent (the code in demo/Search.jhtml).
But this causes exception:
IndexSearcher searcher = new IndexSearcher(cache.getReader("/data/index"));
searcher.close();


searcher = new IndexSearcher(cache.getReader("/data/index"));
searcher.search(aQuery);

when I call the close method the searcher closes the indexreader but the cache (or your getReader method) returns the closed reader one more time

that's why I made a subclass of searcher that can be closed if the user doesn't want to use it any more

you wrote: "sometimes it is difficult to know when it is no longer needed"

I think: "use a cache and you don't have to know when it is no longer needed!" ;)

peter
RE: CachedSearcher [ In reply to ]
> -----Original Message-----
> From: Doug Cutting [mailto:cutting@lucene.com]
> Sent: Tuesday, July 16, 2002 6:56 PM
> To: Lucene Users List
> Subject: Re: CachedSearcher
>
>
> I would be very surprised
> if finalizers for
> the hundreds of files that Lucene might open in a session
> would have any
> measurable impact on garbage collector performance given the
> millions of other
> objects that the garbage collector might process in that session.
>

I think your are right: using finalize method to release a resource has no measurable impact on garbage collector performance.

But if the jvm has no time to run the garbage collector than the finalize method won't be called --> too many file opened exception (it is the limit of the OS not of the jvm)

I'm not sure that all jvm's garbage collector runs as soon as possible.

peter

--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>
Re: CachedSearcher [ In reply to ]
On Tuesday, July 16, 2002, at 01:23 PM, Scott Ganyo wrote:

> Point taken. Indeed, these were general recommendations that
> may/may not
> have a strong impact on Lucene's specific use of finalization. My only
> specific performance claim is that there will be a negative impact
> of some
> degree using finalizers. Whether that impact is noticable or not will
> probably depend upon a number of factors. So I will avoid making any
> further judgements on the impact of finalization in Lucene on the
> performance until I have proof.
>
> Benchmarks aside, my point on the file handles is something that hit us
> square between the eyes. Before we started caching and explicitly
> closing
> our Searchers we would regularly run out of file handles because of
> Lucene.
> This was despite increasing our allocated file handles to ludicrous
> levels
> in the OS. I would recommend that, in general, Java developers
> would be
> well advised to explicitly release external resources when done with
> them
> rather than allowing finalization to take care of it.
>
> Scott
>

Ahh, I take back my last comment about renaming close() to
dispose(). If the IndexReader simply had a bunch of in-memory data,
then dispose() would be appropriate. If it holds onto resources
outside of the VM (typical examples are Window objects, file streams,
network sockets, etc. then close() should be one of those mandatory
methods to be invoked when done with it. In general one should *not*
/rely/ on the GC to clean up external resources. That's an important
lesson repeated in various articles and books and testimonials I've
learned over years of Java development.

This might clear up the issues some people have been having with not
having enough file handles available on their OS.

~ David Smiley


--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>
Re: CachedSearcher [ In reply to ]
Halácsy Péter wrote:
> I made an IndexReaderCache class from the code you have sent (the code in demo/Search.jhtml).
> But this causes exception:
> IndexSearcher searcher = new IndexSearcher(cache.getReader("/data/index"));
> searcher.close();
>
>
> searcher = new IndexSearcher(cache.getReader("/data/index"));
> searcher.search(aQuery);
>
> when I call the close method the searcher closes the indexreader

You don't need to close the searcher. If you don't close it, you won't
have this problem. Finalizers will close the open files.

Doug



--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>