Mailing List Archive

Re: Indexing and Duplication
Kelvin,

>Ype,
>
>That would be a good solution to my problem if only I weren't performing
>multi-threaded indexing. :(
>The Reader obtained by any one thread may not be an accurate reflection of
>the actual state of the index, just what the state when the Reader was
>instantiated.

Why share index readers between threads?
For searching this is fine, off course, but importing can be done
differently.

You might consider changing the functionality of your threads a bit:
one or more threads for indexing and one or
more other threads for extracting the lucene documents.

You could eg. use a bounded queue of batches of lucene docs as input to
the indexing threads. The extracting thread(s) can then put
lucene docs in the next batch and put the batch on the queue.

The only exclusive serial part would then be opening the index reader,
deleting a batch of old docs, and closing the reader. Adding a batch of
new docs can be done by eg. two threads while not using the reader.

For incremental imports an index reader is also needed to check whether a
document has been imported or not. Such checks might be done up front
during a single run of the import program.

In this way the index readers are used for rather short periods
to do some batch of work, and there is no need to share them
between threads.

>My current solution is that I hold a collection of documents with the key as
>my object identifier and only write them to the writer after indexing is

What's the difference between 'writing to the writer' and 'indexing'?

>done. I chose it because it saved me having to write, then delete a
>document, etc. However, it's not so ideal because the memory consumed by
>such an approach may be prohibitive.

>What do you think?

Memory usage can be limited by using a bounded queue. A single batch
of docs on the queue can be limited by eg. the total size of the docs.

I assumed you need to delete old docs while adding new ones. In case
you don't need to delete old docs, you you might not need an
index reader at all.

Ype


>Regards,
>Kelvin
>----- Original Message -----
>From: "Ype Kingma" <ykingma@xs4all.nl>
>To: "Lucene Users List" <lucene-user@jakarta.apache.org>
>Sent: Sunday, March 17, 2002 6:15 AM
>Subject: Re: Indexing and Duplication
>
>
> > Kelvin,
> >
> > >I've got a little problem with indexing that I'd like to throw to
>everyone.
> > >
> > >My objects have a unique identifier. When indexing, before I create a new
> > >document, I'd like to check if a document has already been created with
>this
> > >identifier. If so, I'd like to retrieve the document corresponding to
>this
> > >identifier, and add the fields I currently have to this document's fields
> > >and write it. If no such document exists, then I'd create a new document,
> > >add my fields and write it. What this really does, I guess, is ensure
>that a
> > >document object represents a body of information which really belongs
> > >together, eliminating duplication.
> > >
> > >With the current API, writing and retrieving is performed by the
>IndexWriter
> > >and IndexReader respectively. This effectively means that in order to do
>the
> > >above, I'd have to close the writer, create a new instance of the index
> > >reader after each document has been added in order for the reader to have
> > >the most updated version of the index (!).
> > >
>> >Does anyone have any suggestions how I might approach this?
>>
>> Avoid closing and opening too much by batching n docs at a time
>> on the index reader and then to the things needed for the n docs on the
>> index writer. You might have to delete docs on the reader, too.
> >
> > The reasons for using the reader for reading/searching/deleting
> > and the using writer for adding have been discussed some time ago on this
> > list. I can't provide a pointer into the list archives as I don't recall
>> the original subject header, sorry.
>>
>> Regards,
>> Ype
>>
>> --
>>
>> --
>> To unsubscribe, e-mail:
><mailto:lucene-user-unsubscribe@jakarta.apache.org>
>> For additional commands, e-mail:
><mailto:lucene-user-help@jakarta.apache.org>
>>
>
>
>--
>To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
>For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>


--

--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>
Re: Indexing and Duplication [ In reply to ]
Ype,

----- Original Message -----
From: "Ype Kingma" <ykingma@xs4all.nl>
To: "Lucene Users List" <lucene-user@jakarta.apache.org>
Sent: Tuesday, March 19, 2002 3:57 AM
Subject: Re: Indexing and Duplication


> Kelvin,
>
> >Ype,
> >
> >That would be a good solution to my problem if only I weren't performing
> >multi-threaded indexing. :(
> >The Reader obtained by any one thread may not be an accurate reflection
of
> >the actual state of the index, just what the state when the Reader was
> >instantiated.
>
> Why share index readers between threads?
> For searching this is fine, off course, but importing can be done
> differently.

Even if each thread has its own reader, after it has obtained the reader,
another thread may have written to the index...

>
> You might consider changing the functionality of your threads a bit:
> one or more threads for indexing and one or
> more other threads for extracting the lucene documents.
>
> You could eg. use a bounded queue of batches of lucene docs as input to
> the indexing threads. The extracting thread(s) can then put
> lucene docs in the next batch and put the batch on the queue.
>
> The only exclusive serial part would then be opening the index reader,
> deleting a batch of old docs, and closing the reader. Adding a batch of
> new docs can be done by eg. two threads while not using the reader.
>
> For incremental imports an index reader is also needed to check whether a
> document has been imported or not. Such checks might be done up front
> during a single run of the import program.
>
> In this way the index readers are used for rather short periods
> to do some batch of work, and there is no need to share them
> between threads.

hmmmm...interesting. It's a good suggestion, and I'll need to think abit
more about it.

>
> >My current solution is that I hold a collection of documents with the key
as
> >my object identifier and only write them to the writer after indexing is
>
> What's the difference between 'writing to the writer' and 'indexing'?
>

Sorry. I should've been more explicit. Indexing means the creation of
Document objects and adding fields to them, not adding them to the writer
yet.

> >done. I chose it because it saved me having to write, then delete a
> >document, etc. However, it's not so ideal because the memory consumed by
> >such an approach may be prohibitive.
>
> >What do you think?
>
> Memory usage can be limited by using a bounded queue. A single batch
> of docs on the queue can be limited by eg. the total size of the docs.
>
> I assumed you need to delete old docs while adding new ones. In case
> you don't need to delete old docs, you you might not need an
> index reader at all.

I know. My approach wasn't working with batches at all. Each indexing thread
was just adding documents to a hashtable. The main thread would then iterate
through the hashtable and add them to the writer.

This seems like a silly question, but will keeping hold of Document objects
cause me to run into "Too many files open" problems? If each document object
has a Field.Text which contains a Reader, and the Reader isn't closed till
the document is indexed, would this be an issue? Is the memory consumed by
Document objects directly proportional to the size of the object the Reader
reads?

Thanks.

Regards,
Kelvin

>
> Ype
>
>
> >Regards,
> >Kelvin
> >----- Original Message -----
> >From: "Ype Kingma" <ykingma@xs4all.nl>
> >To: "Lucene Users List" <lucene-user@jakarta.apache.org>
> >Sent: Sunday, March 17, 2002 6:15 AM
> >Subject: Re: Indexing and Duplication
> >
> >
> > > Kelvin,
> > >
> > > >I've got a little problem with indexing that I'd like to throw to
> >everyone.
> > > >
> > > >My objects have a unique identifier. When indexing, before I create a
new
> > > >document, I'd like to check if a document has already been created
with
> >this
> > > >identifier. If so, I'd like to retrieve the document corresponding to
> >this
> > > >identifier, and add the fields I currently have to this document's
fields
> > > >and write it. If no such document exists, then I'd create a new
document,
> > > >add my fields and write it. What this really does, I guess, is ensure
> >that a
> > > >document object represents a body of information which really belongs
> > > >together, eliminating duplication.
> > > >
> > > >With the current API, writing and retrieving is performed by the
> >IndexWriter
> > > >and IndexReader respectively. This effectively means that in order to
do
> >the
> > > >above, I'd have to close the writer, create a new instance of the
index
> > > >reader after each document has been added in order for the reader to
have
> > > >the most updated version of the index (!).
> > > >
> >> >Does anyone have any suggestions how I might approach this?
> >>
> >> Avoid closing and opening too much by batching n docs at a time
> >> on the index reader and then to the things needed for the n docs on the
> >> index writer. You might have to delete docs on the reader, too.
> > >
> > > The reasons for using the reader for reading/searching/deleting
> > > and the using writer for adding have been discussed some time ago on
this
> > > list. I can't provide a pointer into the list archives as I don't
recall
> >> the original subject header, sorry.
> >>
> >> Regards,
> >> Ype
> >>
> >> --
> >>
> >> --
> >> To unsubscribe, e-mail:
> ><mailto:lucene-user-unsubscribe@jakarta.apache.org>
> >> For additional commands, e-mail:
> ><mailto:lucene-user-help@jakarta.apache.org>
> >>
> >
> >
> >--
> >To unsubscribe, e-mail:
<mailto:lucene-user-unsubscribe@jakarta.apache.org>
> >For additional commands, e-mail:
<mailto:lucene-user-help@jakarta.apache.org>
>
>
> --
>
> --
> To unsubscribe, e-mail:
<mailto:lucene-user-unsubscribe@jakarta.apache.org>
> For additional commands, e-mail:
<mailto:lucene-user-help@jakarta.apache.org>
>


--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>
Re: Indexing and Duplication [ In reply to ]
Kelvin,
<snip>

>
>This seems like a silly question, but will keeping hold of Document objects
>cause me to run into "Too many files open" problems? If each document object

No, unless you don't close the evt. files you read the doc fields from.
It depends on how you obtain your document fields.

>has a Field.Text which contains a Reader, and the Reader isn't closed till
>the document is indexed, would this be an issue? Is the memory consumed by

I have not used Readers yet, so I don't know.

>Document objects directly proportional to the size of the object the Reader
>reads?

I think/hope the point of using a Reader is to avoid reading the whole document
into some buffer, so the add() method of the index writer only needs to
tokenize the stream from the Reader.

As for memory usage during indexing:
I have indexed docs with around 100,000 terms in a single String
passed to Field(), and with the max nr. of terms per field set to ten million.
The JVM starts taking more memory occasionaly, but I have not seen it
use more than 17Mb yet (-verbose option to java).

I'd suggest to reconsider the use of a Hashtable to communicate
between threads. I know a Hashtable is thread safe, but some form of queue
is more like the thing one would expect there. Also, with a bounded queue
a limit on memory usage is easily enforced because the feeding thread
will wait as long as needed. For more about queues:
http://g.oswego.edu/dl/classes/EDU/oswego/cs/dl/util/concurrent/intro.html
The faq entry there about producer and consumer threads convinced me
to use bounded queues after I got some out of memory crashes...

Have fun,
Ype

--

--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>
Re: Indexing and Duplication [ In reply to ]
>
>
>This seems like a silly question, but will keeping hold of Document objects
>cause me to run into "Too many files open" problems? If each document object
>has a Field.Text which contains a Reader, and the Reader isn't closed till
>the document is indexed, would this be an issue? Is the memory consumed by
>Document objects directly proportional to the size of the object the Reader
>reads?
>
>Thanks.
>
>Regards,
>Kelvin
>
Yes, I think if the Reader is a regular FileReader that has been opened,
it would consume the file handle. On the other hand, if it is just a
String or a StringReader it would consume memory equal (probably
greater) to the size of the data. One way to fix this is to create your
own Reader class, say DelayedReader, which does not open a file upon
creation, but only upon the first read. That would help safe the file
handles.

Dmitry


--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>
Re: Indexing and Duplication [ In reply to ]
> >
> Yes, I think if the Reader is a regular FileReader that has been opened,
> it would consume the file handle. On the other hand, if it is just a
> String or a StringReader it would consume memory equal (probably
> greater) to the size of the data. One way to fix this is to create your
> own Reader class, say DelayedReader, which does not open a file upon
> creation, but only upon the first read. That would help safe the file
> handles.

That's an excellent suggestion actually, Dmitry. Thanks!

>
> Dmitry
>
>
> --
> To unsubscribe, e-mail:
<mailto:lucene-user-unsubscribe@jakarta.apache.org>
> For additional commands, e-mail:
<mailto:lucene-user-help@jakarta.apache.org>
>


--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>
Re: Indexing and Duplication [ In reply to ]
<snip>

> I'd suggest to reconsider the use of a Hashtable to communicate
> between threads. I know a Hashtable is thread safe, but some form of queue
> is more like the thing one would expect there. Also, with a bounded queue
> a limit on memory usage is easily enforced because the feeding thread
> will wait as long as needed. For more about queues:
> http://g.oswego.edu/dl/classes/EDU/oswego/cs/dl/util/concurrent/intro.html
> The faq entry there about producer and consumer threads convinced me
> to use bounded queues after I got some out of memory crashes...

Thanks Ype, I'll definitely take your advice. Thanks for the link too...

Regards,
Kelvin

>
> Have fun,
> Ype
>
> --
>
> --
> To unsubscribe, e-mail:
<mailto:lucene-user-unsubscribe@jakarta.apache.org>
> For additional commands, e-mail:
<mailto:lucene-user-help@jakarta.apache.org>
>
>


--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>