Mailing List Archive: Crash / Recovery Scenario

Crash / Recovery Scenario

Jul 7, 2002, 10:40 PM

Post #1 of 11 (1355 views)

I'm currently using Lucene to sift through about a million documents, I've
written a servlet to do the indexing and the searching, the servlets are ran
through resin, The Crash scenario I'm thinking of is a web server crash (
for a million possible reasons ) while the index is being updated or
optimized, what I've noticed is the creation of write.lock and commit.lock
files witch stop further indexing because the application thinks that the
previously scheduled indexer is still running (witch could very well be true
depending on the size of the update). This is the recovery I have in mind
but I think it might be somewhat of a hack, On restart of the web server
I've written an Init function that checks for write.lock or commit.lock ,
and if either exist it deletes both of them and optimizes the index. Am I
forgetting anything ? is this wrong ? is there a Lucene specific way of
doing this like running the optimizer with a specific setup.

Nader S. Henein
Bayt.com , Dubai Internet City
Tel. +9714 3911900
Fax. +9714 3911915
GSM. +9715 05659557
www.bayt.com

--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>

Re: Crash / Recovery Scenario [ In reply to ]

korfut at lycos

Jul 8, 2002, 9:43 AM

Post #2 of 11 (1354 views)

Permalink

hi, i do perform the same things as you do, but i do that everytime i got a NullPointerException when i try to run a search . If this happen i try to reopen the index searcher, if i got an exception here i sleep for 500 ms then i try again, after 5 times i generate a servlet exception. Concerning the delete of write.lock and commit.lock, i use a manager,what it does is execute different kind of operation in blocks, like 100 or 1000.
Each operation can be:
1.Delete documents
2.Add documents
3.Search document/s

A combination of this 3 operation allow me to "update" the index with searches still running. But there is a problem "versioning", between current cache of documents and current version of "INDEXED" documents, during update you can search for something that is found in the index but that has been updated in the cache, so i have a bounch of documents duplicate during that, and at the end i notify using a RMI callback all the clients connected to that Manager to re open the index, then i clean up all this duplicate. At this stage i have still an error in case the Manager die because i have all in memory, but i did a little work around to handle that. My next step is make this "transaction" persistent, so i can recovery the previous "status".

Every time i run an operation as listed above i do a check if "write.lock" or "commit.lock" exists, in that case i call the unlock() method, i delete them (if the method unlock doesn't), then i optimize the index.

Until now everything seems to work fine.
ciao.

--

On Mon, 8 Jul 2002 09:40:10
Nader S. Henein wrote:
>
>I'm currently using Lucene to sift through about a million documents, I've
>written a servlet to do the indexing and the searching, the servlets are ran
>through resin, The Crash scenario I'm thinking of is a web server crash (
>for a million possible reasons ) while the index is being updated or
>optimized, what I've noticed is the creation of write.lock and commit.lock
>files witch stop further indexing because the application thinks that the
>previously scheduled indexer is still running (witch could very well be true
>depending on the size of the update). This is the recovery I have in mind
>but I think it might be somewhat of a hack, On restart of the web server
>I've written an Init function that checks for write.lock or commit.lock ,
>and if either exist it deletes both of them and optimizes the index. Am I
>forgetting anything ? is this wrong ? is there a Lucene specific way of
>doing this like running the optimizer with a specific setup.
>
>Nader S. Henein
>Bayt.com , Dubai Internet City
>Tel. +9714 3911900
>Fax. +9714 3911915
>GSM. +9715 05659557
>www.bayt.com
>
>
>--
>To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
>For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>
>
>

_____________________________________________________
Supercharge your e-mail with a 25MB Inbox, POP3 Access, No Ads
and NoTaglines --> LYCOS MAIL PLUS.
http://www.mail.lycos.com/brandPage.shtml?pageId=plus

--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>

Re: Crash / Recovery Scenario [ In reply to ]

otis_gospodnetic at yahoo

Jul 8, 2002, 10:42 AM

Post #3 of 11 (1351 views)

Permalink

Nader,

I don't have a solution for you, but just removing these two files is
probabl not a good idea. There is a reason for their existence.
Actually, check jGuru Lucene FAQ for more information about them.

Otis
P.S.
s/witch/which/gi :)
witch = the ugly woman flying around on a broom stick :)

--- "Nader S. Henein" <nsh@bayt.net> wrote:
>
> I'm currently using Lucene to sift through about a million documents,
> I've
> written a servlet to do the indexing and the searching, the servlets
> are ran
> through resin, The Crash scenario I'm thinking of is a web server
> crash (
> for a million possible reasons ) while the index is being updated or
> optimized, what I've noticed is the creation of write.lock and
> commit.lock
> files witch stop further indexing because the application thinks that
> the
> previously scheduled indexer is still running (witch could very well
> be true
> depending on the size of the update). This is the recovery I have in
> mind
> but I think it might be somewhat of a hack, On restart of the web
> server
> I've written an Init function that checks for write.lock or
> commit.lock ,
> and if either exist it deletes both of them and optimizes the index.
> Am I
> forgetting anything ? is this wrong ? is there a Lucene specific way
> of
> doing this like running the optimizer with a specific setup.
>
> Nader S. Henein
> Bayt.com , Dubai Internet City
> Tel. +9714 3911900
> Fax. +9714 3911915
> GSM. +9715 05659557
> www.bayt.com
>
>
> --
> To unsubscribe, e-mail:
> <mailto:lucene-user-unsubscribe@jakarta.apache.org>
> For additional commands, e-mail:
> <mailto:lucene-user-help@jakarta.apache.org>
>

__________________________________________________
Do You Yahoo!?
Sign up for SBC Yahoo! Dial - First Month Free
http://sbc.yahoo.com

--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>

RE: Crash / Recovery Scenario [ In reply to ]

korfut at lycos

Jul 8, 2002, 12:05 PM

Post #4 of 11 (1360 views)

Permalink

If you tell me the computer doesn't crash , the only thing is you want to stop safely the process, well , in this case the Manager will not stop until the task is not complete, because i am running the Manager as an NT service ,i have a little problem here, you cannot stop a thread while it is doing I/O operation like recursively scan of a directory, you have to wait a little bit.
I see that you are looking for software stability, but software is strictly related with hardware, you need a good hardware too, think about a RAID structure, 0 or 5 depends, think about a clustered system.
This depends what you want from your search engine.
Also i think is good focus on have a good cache status, e.g.: if i have a bad error and i can't recover the index i rebuild it by calling a method that scan all my cache, it is no great but better than nothing . Also i never had that kind of problem.
Also adopt a multi threaded will improve by 40% the actual speed, you need to merge all the segments at the end. (i tested just with 2 thread on Win2K).
If you are looking for a search engine like google, there is a lot of work to do, A LOT!!!!
My opinion is to split index and cache on 'n' machine , but the only thing i don't know how to do it's run a search on multiple index on multiple machine, with sockets will not work, sockets become really slow with heavy traffic, i was thinking on a Java compatible DLL able to merge multiple machine as a logical unit.

ciao.

--

On Mon, 8 Jul 2002 21:07:32
Nader S. Henein wrote:
>brilliant .. I was thinking along the same lines, a new issue that I'm
>facing is just lucene dying on me, in the middle of indexing .. no server
>crash .. nothing .. what do you do if it just stops mid-indexing ?
>
>-----Original Message-----
>From: none none [mailto:korfut@lycos.com]
>Sent: Monday, July 08, 2002 8:42 PM
>To: nsh@bayt.net
>Subject: Re: Crash / Recovery Scenario
>
>
> hi, i do perform the same things as you do, but i do that everytime i got a
>NullPointerException when i try to run a search . If this happen i try to
>reopen the index searcher, if i got an exception here i sleep for 500 ms
>then i try again, after 5 times i generate a servlet exception. Concerning
>the delete of write.lock and commit.lock, i use a manager,what it does is
>execute different kind of operation in blocks, like 100 or 1000.
>Each operation can be:
>1.Delete documents
>2.Add documents
>3.Search document/s
>
>A combination of this 3 operation allow me to "update" the index with
>searches still running. But there is a problem "versioning", between current
>cache of documents and current version of "INDEXED" documents, during update
>you can search for something that is found in the index but that has been
>updated in the cache, so i have a bounch of documents duplicate during that,
>and at the end i notify using a RMI callback all the clients connected to
>that Manager to re open the index, then i clean up all this duplicate. At
>this stage i have still an error in case the Manager die because i have all
>in memory, but i did a little work around to handle that. My next step is
>make this "transaction" persistent, so i can recovery the previous "status".
>
>Every time i run an operation as listed above i do a check if "write.lock"
>or "commit.lock" exists, in that case i call the unlock() method, i delete
>them (if the method unlock doesn't), then i optimize the index.
>
>Until now everything seems to work fine.
>ciao.
>
>--
>
>On Mon, 8 Jul 2002 09:40:10
> Nader S. Henein wrote:
>>
>>I'm currently using Lucene to sift through about a million documents, I've
>>written a servlet to do the indexing and the searching, the servlets are
>ran
>>through resin, The Crash scenario I'm thinking of is a web server crash (
>>for a million possible reasons ) while the index is being updated or
>>optimized, what I've noticed is the creation of write.lock and commit.lock
>>files witch stop further indexing because the application thinks that the
>>previously scheduled indexer is still running (witch could very well be
>true
>>depending on the size of the update). This is the recovery I have in mind
>>but I think it might be somewhat of a hack, On restart of the web server
>>I've written an Init function that checks for write.lock or commit.lock ,
>>and if either exist it deletes both of them and optimizes the index. Am I
>>forgetting anything ? is this wrong ? is there a Lucene specific way of
>>doing this like running the optimizer with a specific setup.
>>
>>Nader S. Henein
>>Bayt.com , Dubai Internet City
>>Tel. +9714 3911900
>>Fax. +9714 3911915
>>GSM. +9715 05659557
>>www.bayt.com
>>
>>
>>--
>>To unsubscribe, e-mail:
><mailto:lucene-user-unsubscribe@jakarta.apache.org>
>>For additional commands, e-mail:
><mailto:lucene-user-help@jakarta.apache.org>
>>
>>
>
>
>_____________________________________________________
>Supercharge your e-mail with a 25MB Inbox, POP3 Access, No Ads
>and NoTaglines --> LYCOS MAIL PLUS.
>http://www.mail.lycos.com/brandPage.shtml?pageId=plus
>
>
>

_____________________________________________________
Supercharge your e-mail with a 25MB Inbox, POP3 Access, No Ads
and NoTaglines --> LYCOS MAIL PLUS.
http://www.mail.lycos.com/brandPage.shtml?pageId=plus

--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>

RE: Crash / Recovery Scenario [ In reply to ]

nsh at bayt

Jul 8, 2002, 9:53 PM

Post #5 of 11 (1366 views)

Permalink

I understand that these files are there for a reason but in case of a web
server crash
lucene will not be able to update/delete/optimize the index in the existence
of these files,
the existence of these two files after a web server restart means that the
crash occurred when
the web server was editing the index and since there is no way to Rollback
(is there?, that would be a cool feature) I have to cut my losses and
continue.

Sorry for thinking out loud but speaking of rollback, I asked a question a
while back about
backing up the index while it's being written to.
http://www.mail-archive.com/lucene-user@jakarta.apache.org/msg01711.html
and Peter told me that it's no problem especially on a Unix machine because
the Lucene writer creates a new index and
only deletes the old one while it's working on the new one, so is there a
way of checking for the .lock files in case
of a crash a rolling back to the old index image?

Nader Henein

-----Original Message-----
From: Otis Gospodnetic [mailto:otis_gospodnetic@yahoo.com]
Sent: Monday, July 08, 2002 9:43 PM
To: Lucene Users List; nsh@bayt.net
Subject: Re: Crash / Recovery Scenario

Nader,

I don't have a solution for you, but just removing these two files is
probabl not a good idea. There is a reason for their existence.
Actually, check jGuru Lucene FAQ for more information about them.

Otis
P.S.
s/witch/which/gi :)
witch = the ugly woman flying around on a broom stick :)

--- "Nader S. Henein" <nsh@bayt.net> wrote:
>
> I'm currently using Lucene to sift through about a million documents,
> I've
> written a servlet to do the indexing and the searching, the servlets
> are ran
> through resin, The Crash scenario I'm thinking of is a web server
> crash (
> for a million possible reasons ) while the index is being updated or
> optimized, what I've noticed is the creation of write.lock and
> commit.lock
> files witch stop further indexing because the application thinks that
> the
> previously scheduled indexer is still running (witch could very well
> be true
> depending on the size of the update). This is the recovery I have in
> mind
> but I think it might be somewhat of a hack, On restart of the web
> server
> I've written an Init function that checks for write.lock or
> commit.lock ,
> and if either exist it deletes both of them and optimizes the index.
> Am I
> forgetting anything ? is this wrong ? is there a Lucene specific way
> of
> doing this like running the optimizer with a specific setup.
>
> Nader S. Henein
> Bayt.com , Dubai Internet City
> Tel. +9714 3911900
> Fax. +9714 3911915
> GSM. +9715 05659557
> www.bayt.com
>
>
> --
> To unsubscribe, e-mail:
> <mailto:lucene-user-unsubscribe@jakarta.apache.org>
> For additional commands, e-mail:
> <mailto:lucene-user-help@jakarta.apache.org>
>

__________________________________________________
Do You Yahoo!?
Sign up for SBC Yahoo! Dial - First Month Free
http://sbc.yahoo.com

--
To unsubscribe, e-mail:
<mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail:
<mailto:lucene-user-help@jakarta.apache.org>

--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>

Re: Crash / Recovery Scenario [ In reply to ]

karl at gan

Jul 9, 2002, 2:48 AM

Post #6 of 11 (1362 views)

Permalink

> only deletes the old one while it's working on the new one, so is there a
> way of checking for the .lock files in case
> of a crash a rolling back to the old index image?
>
> Nader Henein

i have some thoughts about crash/recovery/rollback that i haven't found any
good solutions for.

If a crash happends during writing happens there is no good way to know if the
index is intact, removing lock files doesn't help this fact, as we really
don't know. So providing rollback functionality is a good but expensive way
of compensating for lack of recovery.

To provide rollback i have used a RAMDirectory and serialized it to a SQL
table. By doing this i can catch any exceptions and ask the database to
rollback if required. This works great for small indexes but if the index
grows you will have problems with performance as the whole RAMDir has to be
serialized/deserialized into the BLOB all the time.

A better solution would be to hack the FSDirectory to store each file it would
store in a file-directory as a serialized byte array in a blob of a sql
table. This would increase performance because the whole Directory don't have
to change each time, and it doesn't have to read the while directory into
memory. I also suspect lucene to sort its records into these different files
for increased performance (like: i KNOW that record will be in segment "xxx"
if it is there at all).

I have looked at the source for the RAMDirectory and the FSDirectory and they
could both be altered to store their internal buffers into a BLOB, but i
haven't managed to do this successfully. The problem i have been pounding is
the lucene.InputStream's seek() function. This really requires the underlying
impl to be either a file, or a array in memory. For a BLOB this would mean
that the blob has to be fetched, then read/seek-ed/written/ then stored back
again. (is this correct?!?, and if so is there a way to know WHEN it is
required to fetch/store the array).

I would really appreciate any tips on this as i would think
crash/recovery/rollback functionality to benefit lucene greatly.

I have indexes that uses 5 days to build, and it's really bad to receive
exceptions during a long index run, and no recovery/rollback functionality.

Mvh Karl Øie

--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>

RE: Crash / Recovery Scenario [ In reply to ]

nsh at bayt

Jul 9, 2002, 3:45 AM

Post #7 of 11 (1364 views)

Permalink

I'm not worried about my hardware I've been blessed with an 8 CPU Sum
machine and 2 2 CPU sun Machines with gegs of memory, and I do run Lucene
with 15 threads, I've set my merge factor at 1000 so a lot of work is done
in memory (speed), my current concerns are Recovery related as I'm a few
days from deployment, on a windows based machines I'm not too falmiliar with
the threading setup, the beauty of unix
is you can do anything, I'm worried about Lucene hanging mid indexing, how
do I monitor that?

-----Original Message-----
From: none none [mailto:korfut@lycos.com]
Sent: Monday, July 08, 2002 11:05 PM
To: lucene-user@jakarta.apache.org
Subject: RE: Crash / Recovery Scenario

If you tell me the computer doesn't crash , the only thing is you want to
stop safely the process, well , in this case the Manager will not stop until
the task is not complete, because i am running the Manager as an NT service
,i have a little problem here, you cannot stop a thread while it is doing
I/O operation like recursively scan of a directory, you have to wait a
little bit.
I see that you are looking for software stability, but software is strictly
related with hardware, you need a good hardware too, think about a RAID
structure, 0 or 5 depends, think about a clustered system.
This depends what you want from your search engine.
Also i think is good focus on have a good cache status, e.g.: if i have a
bad error and i can't recover the index i rebuild it by calling a method
that scan all my cache, it is no great but better than nothing . Also i
never had that kind of problem.
Also adopt a multi threaded will improve by 40% the actual speed, you need
to merge all the segments at the end. (i tested just with 2 thread on
Win2K).
If you are looking for a search engine like google, there is a lot of work
to do, A LOT!!!!
My opinion is to split index and cache on 'n' machine , but the only thing i
don't know how to do it's run a search on multiple index on multiple
machine, with sockets will not work, sockets become really slow with heavy
traffic, i was thinking on a Java compatible DLL able to merge multiple
machine as a logical unit.

ciao.

--

On Mon, 8 Jul 2002 21:07:32
Nader S. Henein wrote:
>brilliant .. I was thinking along the same lines, a new issue that I'm
>facing is just lucene dying on me, in the middle of indexing .. no server
>crash .. nothing .. what do you do if it just stops mid-indexing ?
>
>-----Original Message-----
>From: none none [mailto:korfut@lycos.com]
>Sent: Monday, July 08, 2002 8:42 PM
>To: nsh@bayt.net
>Subject: Re: Crash / Recovery Scenario
>
>
> hi, i do perform the same things as you do, but i do that everytime i got
a
>NullPointerException when i try to run a search . If this happen i try to
>reopen the index searcher, if i got an exception here i sleep for 500 ms
>then i try again, after 5 times i generate a servlet exception. Concerning
>the delete of write.lock and commit.lock, i use a manager,what it does is
>execute different kind of operation in blocks, like 100 or 1000.
>Each operation can be:
>1.Delete documents
>2.Add documents
>3.Search document/s
>
>A combination of this 3 operation allow me to "update" the index with
>searches still running. But there is a problem "versioning", between
current
>cache of documents and current version of "INDEXED" documents, during
update
>you can search for something that is found in the index but that has been
>updated in the cache, so i have a bounch of documents duplicate during
that,
>and at the end i notify using a RMI callback all the clients connected to
>that Manager to re open the index, then i clean up all this duplicate. At
>this stage i have still an error in case the Manager die because i have all
>in memory, but i did a little work around to handle that. My next step is
>make this "transaction" persistent, so i can recovery the previous
"status".
>
>Every time i run an operation as listed above i do a check if "write.lock"
>or "commit.lock" exists, in that case i call the unlock() method, i delete
>them (if the method unlock doesn't), then i optimize the index.
>
>Until now everything seems to work fine.
>ciao.
>
>--
>
>On Mon, 8 Jul 2002 09:40:10
> Nader S. Henein wrote:
>>
>>I'm currently using Lucene to sift through about a million documents, I've
>>written a servlet to do the indexing and the searching, the servlets are
>ran
>>through resin, The Crash scenario I'm thinking of is a web server crash (
>>for a million possible reasons ) while the index is being updated or
>>optimized, what I've noticed is the creation of write.lock and commit.lock
>>files witch stop further indexing because the application thinks that the
>>previously scheduled indexer is still running (witch could very well be
>true
>>depending on the size of the update). This is the recovery I have in mind
>>but I think it might be somewhat of a hack, On restart of the web server
>>I've written an Init function that checks for write.lock or commit.lock ,
>>and if either exist it deletes both of them and optimizes the index. Am I
>>forgetting anything ? is this wrong ? is there a Lucene specific way of
>>doing this like running the optimizer with a specific setup.
>>
>>Nader S. Henein
>>Bayt.com , Dubai Internet City
>>Tel. +9714 3911900
>>Fax. +9714 3911915
>>GSM. +9715 05659557
>>www.bayt.com
>>
>>
>>--
>>To unsubscribe, e-mail:
><mailto:lucene-user-unsubscribe@jakarta.apache.org>
>>For additional commands, e-mail:
><mailto:lucene-user-help@jakarta.apache.org>
>>
>>
>
>
>_____________________________________________________
>Supercharge your e-mail with a 25MB Inbox, POP3 Access, No Ads
>and NoTaglines --> LYCOS MAIL PLUS.
>http://www.mail.lycos.com/brandPage.shtml?pageId=plus
>
>
>

_____________________________________________________
Supercharge your e-mail with a 25MB Inbox, POP3 Access, No Ads
and NoTaglines --> LYCOS MAIL PLUS.
http://www.mail.lycos.com/brandPage.shtml?pageId=plus

--
To unsubscribe, e-mail:
<mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail:
<mailto:lucene-user-help@jakarta.apache.org>

--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>

RE: Crash / Recovery Scenario [ In reply to ]

nsh at bayt

Jul 9, 2002, 3:51 AM

Post #8 of 11 (1352 views)

Permalink

Karl, what if I copy the index in memory or in another directory prior to
indexing thereby, assuring a working index in the case of a crash. I want to
stay away from DB interaction as I am trying to move out of an Oracle
Intermedia search solution (if you saw the Oracle price list you would too).
I have a backup process witch
1) Checks if the index is being updated
2) Does a small trial search (to ensure that the index s not corrupt)
3) Tar the index and move the file to another disk

I'm thinking of writing a full backup/restore add-on to Lucene so all of
this can be jared together as part of the package.

Nader

-----Original Message-----
From: Karl Øie [mailto:karl@gan.no]
Sent: Tuesday, July 09, 2002 1:49 PM
To: Lucene Users List
Subject: Re: Crash / Recovery Scenario

> only deletes the old one while it's working on the new one, so is there a
> way of checking for the .lock files in case
> of a crash a rolling back to the old index image?
>
> Nader Henein

i have some thoughts about crash/recovery/rollback that i haven't found any
good solutions for.

If a crash happends during writing happens there is no good way to know if
the
index is intact, removing lock files doesn't help this fact, as we really
don't know. So providing rollback functionality is a good but expensive way
of compensating for lack of recovery.

To provide rollback i have used a RAMDirectory and serialized it to a SQL
table. By doing this i can catch any exceptions and ask the database to
rollback if required. This works great for small indexes but if the index
grows you will have problems with performance as the whole RAMDir has to be
serialized/deserialized into the BLOB all the time.

A better solution would be to hack the FSDirectory to store each file it
would
store in a file-directory as a serialized byte array in a blob of a sql
table. This would increase performance because the whole Directory don't
have
to change each time, and it doesn't have to read the while directory into
memory. I also suspect lucene to sort its records into these different files
for increased performance (like: i KNOW that record will be in segment "xxx"
if it is there at all).

I have looked at the source for the RAMDirectory and the FSDirectory and
they
could both be altered to store their internal buffers into a BLOB, but i
haven't managed to do this successfully. The problem i have been pounding is
the lucene.InputStream's seek() function. This really requires the
underlying
impl to be either a file, or a array in memory. For a BLOB this would mean
that the blob has to be fetched, then read/seek-ed/written/ then stored back
again. (is this correct?!?, and if so is there a way to know WHEN it is
required to fetch/store the array).

I would really appreciate any tips on this as i would think
crash/recovery/rollback functionality to benefit lucene greatly.

I have indexes that uses 5 days to build, and it's really bad to receive
exceptions during a long index run, and no recovery/rollback functionality.

Mvh Karl Øie

--
To unsubscribe, e-mail:
<mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail:
<mailto:lucene-user-help@jakarta.apache.org>

--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>

Re: Crash / Recovery Scenario [ In reply to ]

karl at gan

Jul 9, 2002, 4:18 AM

Post #9 of 11 (1364 views)

Permalink

> to stay away from DB interaction as I am trying to move out of an Oracle
> Intermedia search solution (if you saw the Oracle price list you would
> too)

I have, and i am now using PostgreSQL and IBM DB2 exclusevly :-)

> 1) Checks if the index is being updated
> 2) Does a small trial search (to ensure that the index s not corrupt)
> 3) Tar the index and move the file to another disk
>
> I'm thinking of writing a full backup/restore add-on to Lucene so all of
> this can be jared together as part of the package.
>
> Nader

i would really like to see your implementation of this as i find this area to
be the only weak point in lucene.

mvh karl øie

--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>

Re: Crash / Recovery Scenario [ In reply to ]

cutting at lucene

Jul 10, 2002, 9:39 AM

Post #10 of 11 (1356 views)

Permalink

Karl Øie wrote:
> If a crash happends during writing happens there is no good way to know if the
> index is intact, removing lock files doesn't help this fact, as we really
> don't know. So providing rollback functionality is a good but expensive way
> of compensating for lack of recovery.

The index is intact. It is always intact. This has been discussed before.

Doug

--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>

Re: Crash / Recovery Scenario [ In reply to ]

cutting at lucene

Jul 10, 2002, 12:40 PM

Post #11 of 11 (1353 views)

Permalink

Karl Øie wrote:
> A better solution would be to hack the FSDirectory to store each file it would
> store in a file-directory as a serialized byte array in a blob of a sql
> table. This would increase performance because the whole Directory don't have
> to change each time, and it doesn't have to read the while directory into
> memory. I also suspect lucene to sort its records into these different files
> for increased performance (like: i KNOW that record will be in segment "xxx"
> if it is there at all).
>
> I have looked at the source for the RAMDirectory and the FSDirectory and they
> could both be altered to store their internal buffers into a BLOB, but i
> haven't managed to do this successfully. The problem i have been pounding is
> the lucene.InputStream's seek() function. This really requires the underlying
> impl to be either a file, or a array in memory. For a BLOB this would mean
> that the blob has to be fetched, then read/seek-ed/written/ then stored back
> again. (is this correct?!?, and if so is there a way to know WHEN it is
> required to fetch/store the array).

A BLOB can be randomly accessed:

http://java.sun.com/j2se/1.4/docs/api/java/sql/Blob.html#getBytes(long,%20int)

A good driver should page BLOBs over the connection. A great driver
might even have a separate thread doing read-aheads. (Dream on.) It
looks like the leading JDBC driver for MySQL (mm) does not page blobs,
but rather always reads the entire blob. Sigh. On the bright side, the
JDBC driver for PostgreSQL does page BLOBS over the connection.

So it should be easy to implement a Lucene InputStream based on a BLOB.
The Directory should be a simple table of BLOBs.

Lucene rarely seeks on writable streams. In other words, nearly all
files are written sequentially. With a quick scan, I can see only one
place where Lucene seeks an OutputStream: in TermInfosWriter it
overwrites the first four bytes once just before the file is closed.

So to implement a Lucene OutputStream you could cache the value of
Blob.setBinaryStream(int), and only create a new underlying output
stream when seek() is called.

Doug

--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>