Hello,
I need some advice regarding incremental index updates.
There are three cases I need to handle when iterating over the
sourcefiles (files that need to be indexed):
1. A file did not change since the last update
2. A file did change since the last update
3. A file was removed since the last update
Case 1. is easy...
Case 2. as well.. just remove the old file and add the new one
Case 3. is bugging me..
How can I find out if a file which is specified in the index, does not
exist anymore?
The blunt solution would be to retrieve *all* file paths from the index,
and check whether each one exists. If so - go on, if the file does not
exist on disk, remove it from the index. The problem I have with this
is, that I am possibly pulling a lot of data from the lucene index. I
will also do a lot of local filesystem checks. Sloooow?!
Another idea I had is about introducing an "index version" integer. This
number will be unique for each start of the parsing process. So each
time my indexer program is started a new "index version" is created. Now
each file which exists in the index and gets processed will have the
"index version" number stored as a document field.
This way all newly added and modified documents will have an up to date
"index version" flag after indexing is complete.
Now, to remove all physically deleted files from the index, I would
select all documents which have an old "index version" flag stored
inside them. Every document with such an old number can be safely
removed.
Problem with this solution is, that *every* document in the index will
get updated: First the old index version field is removed, then the new
field is added.
On the plusside, removing deleted files will be very fast.
What would you recommend for keeping an incremental update?
I fear the first version will be utterly slow for small updates whereas
the second version will be a lot faster - though adding stuff is slower
because of the additional field update for every document.
Thanks for your advice,
Johannes :-)
I need some advice regarding incremental index updates.
There are three cases I need to handle when iterating over the
sourcefiles (files that need to be indexed):
1. A file did not change since the last update
2. A file did change since the last update
3. A file was removed since the last update
Case 1. is easy...
Case 2. as well.. just remove the old file and add the new one
Case 3. is bugging me..
How can I find out if a file which is specified in the index, does not
exist anymore?
The blunt solution would be to retrieve *all* file paths from the index,
and check whether each one exists. If so - go on, if the file does not
exist on disk, remove it from the index. The problem I have with this
is, that I am possibly pulling a lot of data from the lucene index. I
will also do a lot of local filesystem checks. Sloooow?!
Another idea I had is about introducing an "index version" integer. This
number will be unique for each start of the parsing process. So each
time my indexer program is started a new "index version" is created. Now
each file which exists in the index and gets processed will have the
"index version" number stored as a document field.
This way all newly added and modified documents will have an up to date
"index version" flag after indexing is complete.
Now, to remove all physically deleted files from the index, I would
select all documents which have an old "index version" flag stored
inside them. Every document with such an old number can be safely
removed.
Problem with this solution is, that *every* document in the index will
get updated: First the old index version field is removed, then the new
field is added.
On the plusside, removing deleted files will be very fast.
What would you recommend for keeping an incremental update?
I fear the first version will be utterly slow for small updates whereas
the second version will be a lot faster - though adding stuff is slower
because of the additional field update for every document.
Thanks for your advice,
Johannes :-)