Mailing List Archive

Need advice for doing incremental Index updates
Hello,



I need some advice regarding incremental index updates.



There are three cases I need to handle when iterating over the
sourcefiles (files that need to be indexed):

1. A file did not change since the last update
2. A file did change since the last update
3. A file was removed since the last update



Case 1. is easy...

Case 2. as well.. just remove the old file and add the new one

Case 3. is bugging me..



How can I find out if a file which is specified in the index, does not
exist anymore?



The blunt solution would be to retrieve *all* file paths from the index,
and check whether each one exists. If so - go on, if the file does not
exist on disk, remove it from the index. The problem I have with this
is, that I am possibly pulling a lot of data from the lucene index. I
will also do a lot of local filesystem checks. Sloooow?!



Another idea I had is about introducing an "index version" integer. This
number will be unique for each start of the parsing process. So each
time my indexer program is started a new "index version" is created. Now
each file which exists in the index and gets processed will have the
"index version" number stored as a document field.

This way all newly added and modified documents will have an up to date
"index version" flag after indexing is complete.

Now, to remove all physically deleted files from the index, I would
select all documents which have an old "index version" flag stored
inside them. Every document with such an old number can be safely
removed.

Problem with this solution is, that *every* document in the index will
get updated: First the old index version field is removed, then the new
field is added.

On the plusside, removing deleted files will be very fast.





What would you recommend for keeping an incremental update?

I fear the first version will be utterly slow for small updates whereas
the second version will be a lot faster - though adding stuff is slower
because of the additional field update for every document.



Thanks for your advice,

Johannes :-)
Re: Need advice for doing incremental Index updates [ In reply to ]
i would solve your problem external to the index ... everytime you run
your incrimental process, as you walk your directory tree of files (adding
the new ones, deleting/readdign the modified ones) record every file and
save that somewhere. when you are all done, compare the list from this
run with the list from the last run -- any file in the old list and not in
hte new list is a document to be deleted.


: Date: Tue, 8 Aug 2006 15:48:16 +0200
: From: "Leimbach, Johannes" <JLeimbach@CONET.DE>
: Reply-To: general@lucene.apache.org
: To: general@lucene.apache.org
: Subject: Need advice for doing incremental Index updates
:
: Hello,
:
:
:
: I need some advice regarding incremental index updates.
:
:
:
: There are three cases I need to handle when iterating over the
: sourcefiles (files that need to be indexed):
:
: 1. A file did not change since the last update
: 2. A file did change since the last update
: 3. A file was removed since the last update
:
:
:
: Case 1. is easy...
:
: Case 2. as well.. just remove the old file and add the new one
:
: Case 3. is bugging me..
:
:
:
: How can I find out if a file which is specified in the index, does not
: exist anymore?
:
:
:
: The blunt solution would be to retrieve *all* file paths from the index,
: and check whether each one exists. If so - go on, if the file does not
: exist on disk, remove it from the index. The problem I have with this
: is, that I am possibly pulling a lot of data from the lucene index. I
: will also do a lot of local filesystem checks. Sloooow?!
:
:
:
: Another idea I had is about introducing an "index version" integer. This
: number will be unique for each start of the parsing process. So each
: time my indexer program is started a new "index version" is created. Now
: each file which exists in the index and gets processed will have the
: "index version" number stored as a document field.
:
: This way all newly added and modified documents will have an up to date
: "index version" flag after indexing is complete.
:
: Now, to remove all physically deleted files from the index, I would
: select all documents which have an old "index version" flag stored
: inside them. Every document with such an old number can be safely
: removed.
:
: Problem with this solution is, that *every* document in the index will
: get updated: First the old index version field is removed, then the new
: field is added.
:
: On the plusside, removing deleted files will be very fast.
:
:
:
:
:
: What would you recommend for keeping an incremental update?
:
: I fear the first version will be utterly slow for small updates whereas
: the second version will be a lot faster - though adding stuff is slower
: because of the additional field update for every document.
:
:
:
: Thanks for your advice,
:
: Johannes :-)
:
:
:
:
:
:



-Hoss
Re: Need advice for doing incremental Index updates [ In reply to ]
Hi,
If run the incrimental process,as walk my directory tree of files,does it
cost more time?
Because I must run a thread to do as you said,and it runs all the time.
Thanks ,
john




----- Original Message -----
From: "Chris Hostetter" <hossman_lucene@fucit.org>
To: <general@lucene.apache.org>
Sent: Wednesday, August 09, 2006 5:32 AM
Subject: Re: Need advice for doing incremental Index updates


>
> i would solve your problem external to the index ... everytime you run
> your incrimental process, as you walk your directory tree of files (adding
> the new ones, deleting/readdign the modified ones) record every file and
> save that somewhere. when you are all done, compare the list from this
> run with the list from the last run -- any file in the old list and not in
> hte new list is a document to be deleted.
>
>
> : Date: Tue, 8 Aug 2006 15:48:16 +0200
> : From: "Leimbach, Johannes" <JLeimbach@CONET.DE>
> : Reply-To: general@lucene.apache.org
> : To: general@lucene.apache.org
> : Subject: Need advice for doing incremental Index updates
> :
> : Hello,
> :
> :
> :
> : I need some advice regarding incremental index updates.
> :
> :
> :
> : There are three cases I need to handle when iterating over the
> : sourcefiles (files that need to be indexed):
> :
> : 1. A file did not change since the last update
> : 2. A file did change since the last update
> : 3. A file was removed since the last update
> :
> :
> :
> : Case 1. is easy...
> :
> : Case 2. as well.. just remove the old file and add the new one
> :
> : Case 3. is bugging me..
> :
> :
> :
> : How can I find out if a file which is specified in the index, does not
> : exist anymore?
> :
> :
> :
> : The blunt solution would be to retrieve *all* file paths from the index,
> : and check whether each one exists. If so - go on, if the file does not
> : exist on disk, remove it from the index. The problem I have with this
> : is, that I am possibly pulling a lot of data from the lucene index. I
> : will also do a lot of local filesystem checks. Sloooow?!
> :
> :
> :
> : Another idea I had is about introducing an "index version" integer. This
> : number will be unique for each start of the parsing process. So each
> : time my indexer program is started a new "index version" is created. Now
> : each file which exists in the index and gets processed will have the
> : "index version" number stored as a document field.
> :
> : This way all newly added and modified documents will have an up to date
> : "index version" flag after indexing is complete.
> :
> : Now, to remove all physically deleted files from the index, I would
> : select all documents which have an old "index version" flag stored
> : inside them. Every document with such an old number can be safely
> : removed.
> :
> : Problem with this solution is, that *every* document in the index will
> : get updated: First the old index version field is removed, then the new
> : field is added.
> :
> : On the plusside, removing deleted files will be very fast.
> :
> :
> :
> :
> :
> : What would you recommend for keeping an incremental update?
> :
> : I fear the first version will be utterly slow for small updates whereas
> : the second version will be a lot faster - though adding stuff is slower
> : because of the additional field update for every document.
> :
> :
> :
> : Thanks for your advice,
> :
> : Johannes :-)
> :
> :
> :
> :
> :
> :
>
>
>
> -Hoss
AW: Need advice for doing incremental Index updates [ In reply to ]
Good morning Chris,

Thank you for your answer.

I have though about using an external filetable to solve my problem, but I don't like this idea very much either.

The problem is, that your lucene index might be very easily get corrupted and out of sync. Imagine the external index gets lost or writing to it is aborted while still updating the index. It seems like you can get very easily inconsistencies. Can't I?

Though this will probably be the way I'll gonna go.. touching everything in the index might be truly atomic but too slow.

To John:
I don't understand your question, can you post it again?

Bye,
Johannes

-----Ursprüngliche Nachricht-----
Von: Chris Hostetter [mailto:hossman_lucene@fucit.org]
Gesendet: Dienstag, 8. August 2006 23:32
An: general@lucene.apache.org
Betreff: Re: Need advice for doing incremental Index updates


i would solve your problem external to the index ... everytime you run
your incrimental process, as you walk your directory tree of files (adding
the new ones, deleting/readdign the modified ones) record every file and
save that somewhere. when you are all done, compare the list from this
run with the list from the last run -- any file in the old list and not in
hte new list is a document to be deleted.


: Date: Tue, 8 Aug 2006 15:48:16 +0200
: From: "Leimbach, Johannes" <JLeimbach@CONET.DE>
: Reply-To: general@lucene.apache.org
: To: general@lucene.apache.org
: Subject: Need advice for doing incremental Index updates
:
: Hello,
:
:
:
: I need some advice regarding incremental index updates.
:
:
:
: There are three cases I need to handle when iterating over the
: sourcefiles (files that need to be indexed):
:
: 1. A file did not change since the last update
: 2. A file did change since the last update
: 3. A file was removed since the last update
:
:
:
: Case 1. is easy...
:
: Case 2. as well.. just remove the old file and add the new one
:
: Case 3. is bugging me..
:
:
:
: How can I find out if a file which is specified in the index, does not
: exist anymore?
:
:
:
: The blunt solution would be to retrieve *all* file paths from the index,
: and check whether each one exists. If so - go on, if the file does not
: exist on disk, remove it from the index. The problem I have with this
: is, that I am possibly pulling a lot of data from the lucene index. I
: will also do a lot of local filesystem checks. Sloooow?!
:
:
:
: Another idea I had is about introducing an "index version" integer. This
: number will be unique for each start of the parsing process. So each
: time my indexer program is started a new "index version" is created. Now
: each file which exists in the index and gets processed will have the
: "index version" number stored as a document field.
:
: This way all newly added and modified documents will have an up to date
: "index version" flag after indexing is complete.
:
: Now, to remove all physically deleted files from the index, I would
: select all documents which have an old "index version" flag stored
: inside them. Every document with such an old number can be safely
: removed.
:
: Problem with this solution is, that *every* document in the index will
: get updated: First the old index version field is removed, then the new
: field is added.
:
: On the plusside, removing deleted files will be very fast.
:
:
:
:
:
: What would you recommend for keeping an incremental update?
:
: I fear the first version will be utterly slow for small updates whereas
: the second version will be a lot faster - though adding stuff is slower
: because of the additional field update for every document.
:
:
:
: Thanks for your advice,
:
: Johannes :-)
:
:
:
:
:
:



-Hoss