Mailing List Archive: Moving Index from Crawl/Build Server to Search Server

Moving Index from Crawl/Build Server to Search Server

Jan 30, 2002, 5:00 PM

Post #1 of 3 (797 views)

I am working on a browser-based search application that crawls web and file documents. I would like to do my crawling and index building on one server and my searching on one or more other servers. I want maximum up time and performance on my search servers. What is the best way to move the index from the build server to the search servers and then change which index a user is searching against? I am concerned about switching the index while a user is paging through search results. Ideally new users will access the new index while current users will access the old index. How will I know when all current users are no longer accessing the old index so that it can be deleted?

Thanks,

Mark Tucker

--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>

Re: Moving Index from Crawl/Build Server to Search Server [ In reply to ]

ykingma at xs4all

Jan 31, 2002, 2:36 AM

Post #2 of 3 (766 views)

Permalink

Mark,

>I am working on a browser-based search application that crawls web and file documents. I would like to do my crawling and index building on one server and my searching on one or more other servers. I want maximum up time and performance on my search servers. What is the best way to move the index from the build server to the search servers and then change which index a user is searching against? I am concerned about switching the index while a user is paging through search results. Ideally new users will access the new index while current users will access the old index. How will I know when all current users are no longer accessing the old index so that it can be deleted?

Use separate disks for the old and the new index.
Write a class to switch between the two for searching, see
http://www.mail-archive.com/lucene-user@jakarta.apache.org/msg00133.html
for the requirements lucene poses.
Every once in a while you'll have to check whether the new index is
available, eg by inspecting a file with the preferred index to use.
When it's time to change over, direct all new users of your index reader to the
new index and wait for all users of the old index to finish their work.
Let the administrator know when all old users have finished.

Have a look at Doug Lea's util.concurrent library. The writer preference
reader/writer lock might be handy here with the writer doing the change over.
This effectively waits for all old users to finish. When you want old
and new users working together, you'll need something more complex.

For maximum performance you'll need to limit the number of concurrent
users of the index (indices) anyway.

To move a new index to different search servers use a low priority copy,
and evt. more than one CPU.
Lucene has some facilities to merge indices which allows you to copy
only the newer parts of an index and then merge locally. This does not
delete old lucene documents, though.
Once you've changed over you can also update the old index and use
that for more performance, ie. you can have multiple entries in
the file with the preferred index.

As always, lots of choices. Try the simple ones before buying
a mainframe :)

Have fun,
Ype

--

--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>

RE: Moving Index from Crawl/Build Server to Search Server [ In reply to ]

DCutting at grandcentral

Jan 31, 2002, 9:09 AM

Post #3 of 3 (770 views)

Permalink

> From: Mark Tucker [mailto:MTucker@infoimage.com]
>
> What is the best way to
> move the index from the build server to the search servers
> and then change which index a user is searching against? I
> am concerned about switching the index while a user is paging
> through search results. Ideally new users will access the
> new index while current users will access the old index. How
> will I know when all current users are no longer accessing
> the old index so that it can be deleted?

If you're using Unix, this is easy. Once an IndexReader is created, it
keeps open all files it needs. Since Unix lets you delete files that are
open, there's no conflict. When the last search using the old IndexReader
completes and it is garbage collected, finalizers will close the files and
the OS will free the disk space.

The simplest approach on Unix is probably to use a symbolic link to the
latest version of the index. When a new index is installed, just update the
symbolic link to the new index and delete the old one: "ln -sf new latest;
rm -rf old".

Use something like the following to cache the reader for the index in
"latest":
private IndexReader reader;
private long readerLastModified;
public synchronized IndexReader getIndexReader() {
if (lastModified != IndexReader.lastModified("latest")) {
// there's a new index: open it
lastModified = IndexReader.lastModified("latest");
indexReader = IndexReader.open("latest");
}
return indexReader;
}

If you call this for each search, then searches will always search the
latest index. It's most efficient keep just one IndexReader per index for
searching.

On Win32 there are two complications. First, you cannot use a symbolic link
to refer to the latest version of the index. Nor can you rename or
overwrite the existing index, since it has open files. So you must do
something like list a directory, looking for a new index directory. Second,
you need to wait until all searches are complete before the old index can be
deleted. You can either spawn a thread that keeps trying to delete it.
When the last search exits and the IndexWriter is garbage collected this
will succeed. A riskier approach is simply to wait five minutes, then call
IndexReader.close(), then delete it. Any searches in progress on the old
index will crash when it is closed, but the delete should succeed! If
anyone knows a way around these Win32 issues, I'd love to hear it.

Doug

--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>