Mailing List Archive

Incorrect CollectionStatistics if IndexWriter.close is not called
Hi,

I don't understand if I'm doing something wrong or if it is the
expected behaviour.

My problem is when a document is updated the collectionStatistics
returns counts as if a new document is added in the index, even after
a call to IndexWriter.commit and to
SearcherManager.maybeRefreshBlocking.
If I call the IndexWriter.close, the counts are correct again, but the
documentation of IndexWriter.close says to try to reuse the
IndexWriter so I'm a bit confused.

Ex:
If I add two documents to an empty index

IndexSearcher.collectionStatistics("TEXT")) returns
"field="TEXT",maxDoc=2,docCount=2,sumTotalTermFreq=5,sumDocFreq=5" ->
OK

then I update one of the document and call commit()

IndexSearcher.collectionStatistics("TEXT")) returns
"field="TEXT",maxDoc=3,docCount=3,sumTotalTermFreq=9,sumDocFreq=9" ->
NOK

If I call close() now

IndexSearcher.collectionStatistics("TEXT")) returns
"field="TEXT",maxDoc=2,docCount=2,sumTotalTermFreq=6,sumDocFreq=6" ->
OK

Note that the counts are correct if the index contains only one document.


I attached a test case.

Am I doing something wrong somewhere?


Julien
Re:Incorrect CollectionStatistics if IndexWriter.close is not called [ In reply to ]
I *guess* it's due to the fact that the update is implemented as remove and reinsert the document. Deletes in Lucene are lazy: the deleted document is just flagged as deleted in a bitmap and then removed from the index only when segments are merged. Did you check IndexSearcher.collectionStatistic documentation? it should mention something about that..

Cheers,
diego


From: java-user@lucene.apache.org At: 02/28/21 11:09:52To: java-user@lucene.apache.org
Subject: Incorrect CollectionStatistics if IndexWriter.close is not called

Hi,

I don't understand if I'm doing something wrong or if it is the
expected behaviour.

My problem is when a document is updated the collectionStatistics
returns counts as if a new document is added in the index, even after
a call to IndexWriter.commit and to
SearcherManager.maybeRefreshBlocking.
If I call the IndexWriter.close, the counts are correct again, but the
documentation of IndexWriter.close says to try to reuse the
IndexWriter so I'm a bit confused.

Ex:
If I add two documents to an empty index

IndexSearcher.collectionStatistics("TEXT")) returns
"field="TEXT",maxDoc=2,docCount=2,sumTotalTermFreq=5,sumDocFreq=5" ->
OK

then I update one of the document and call commit()

IndexSearcher.collectionStatistics("TEXT")) returns
"field="TEXT",maxDoc=3,docCount=3,sumTotalTermFreq=9,sumDocFreq=9" ->
NOK

If I call close() now

IndexSearcher.collectionStatistics("TEXT")) returns
"field="TEXT",maxDoc=2,docCount=2,sumTotalTermFreq=6,sumDocFreq=6" ->
OK

Note that the counts are correct if the index contains only one document.


I attached a test case.

Am I doing something wrong somewhere?


Julien


----------------------------


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Incorrect CollectionStatistics if IndexWriter.close is not called [ In reply to ]
Hi,

You're right the documentation of Terms.getDocCount says that "this
measure does not take deleted documents into account".
So if we want correct counts and correct query scores, the IndexWriter
has to be closed after documents are deleted/updated and a new one has
to be created when new documents arrive.

Thanks

Le dim. 28 févr. 2021 à 17:04, Diego Ceccarelli (BLOOMBERG/ LONDON)
<dceccarelli4@bloomberg.net> a écrit :
>
> I *guess* it's due to the fact that the update is implemented as remove and reinsert the document. Deletes in Lucene are lazy: the deleted document is just flagged as deleted in a bitmap and then removed from the index only when segments are merged. Did you check IndexSearcher.collectionStatistic documentation? it should mention something about that..
>
> Cheers,
> diego
>
>
> From: java-user@lucene.apache.org At: 02/28/21 11:09:52To: java-user@lucene.apache.org
> Subject: Incorrect CollectionStatistics if IndexWriter.close is not called
>
> Hi,
>
> I don't understand if I'm doing something wrong or if it is the
> expected behaviour.
>
> My problem is when a document is updated the collectionStatistics
> returns counts as if a new document is added in the index, even after
> a call to IndexWriter.commit and to
> SearcherManager.maybeRefreshBlocking.
> If I call the IndexWriter.close, the counts are correct again, but the
> documentation of IndexWriter.close says to try to reuse the
> IndexWriter so I'm a bit confused.
>
> Ex:
> If I add two documents to an empty index
>
> IndexSearcher.collectionStatistics("TEXT")) returns
> "field="TEXT",maxDoc=2,docCount=2,sumTotalTermFreq=5,sumDocFreq=5" ->
> OK
>
> then I update one of the document and call commit()
>
> IndexSearcher.collectionStatistics("TEXT")) returns
> "field="TEXT",maxDoc=3,docCount=3,sumTotalTermFreq=9,sumDocFreq=9" ->
> NOK
>
> If I call close() now
>
> IndexSearcher.collectionStatistics("TEXT")) returns
> "field="TEXT",maxDoc=2,docCount=2,sumTotalTermFreq=6,sumDocFreq=6" ->
> OK
>
> Note that the counts are correct if the index contains only one document.
>
>
> I attached a test case.
>
> Am I doing something wrong somewhere?
>
>
> Julien
>
>
> ----------------------------
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Incorrect CollectionStatistics if IndexWriter.close is not called [ In reply to ]
I'm not sure that closing and opening the index writer will always work - I think the 'problem' will be solved once the segment with the deleted document will be merged with another segment - that might happen during the close but might also *not* happen (e.g., if you have only one segment, and you delete, probably closing/opening won't fix).

Can you describe your problem that you are trying to solve? why do you need such accuracy? if this is for some type of scoring the ranking shouldn't be affected if you have X or X-1 documents in the collection...

Cheers,
diego

From: java-user@lucene.apache.org At: 03/01/21 16:23:48To: Diego Ceccarelli (BLOOMBERG/ LONDON ) , java-user@lucene.apache.org
Subject: Re: Incorrect CollectionStatistics if IndexWriter.close is not called

Hi,

You're right the documentation of Terms.getDocCount says that "this
measure does not take deleted documents into account".
So if we want correct counts and correct query scores, the IndexWriter
has to be closed after documents are deleted/updated and a new one has
to be created when new documents arrive.

Thanks

Le dim. 28 févr. 2021 à 17:04, Diego Ceccarelli (BLOOMBERG/ LONDON)
<dceccarelli4@bloomberg.net> a écrit :
>
> I *guess* it's due to the fact that the update is implemented as remove and
reinsert the document. Deletes in Lucene are lazy: the deleted document is just
flagged as deleted in a bitmap and then removed from the index only when
segments are merged. Did you check IndexSearcher.collectionStatistic
documentation? it should mention something about that..
>
> Cheers,
> diego
>
>
> From: java-user@lucene.apache.org At: 02/28/21 11:09:52To:
java-user@lucene.apache.org
> Subject: Incorrect CollectionStatistics if IndexWriter.close is not called
>
> Hi,
>
> I don't understand if I'm doing something wrong or if it is the
> expected behaviour.
>
> My problem is when a document is updated the collectionStatistics
> returns counts as if a new document is added in the index, even after
> a call to IndexWriter.commit and to
> SearcherManager.maybeRefreshBlocking.
> If I call the IndexWriter.close, the counts are correct again, but the
> documentation of IndexWriter.close says to try to reuse the
> IndexWriter so I'm a bit confused.
>
> Ex:
> If I add two documents to an empty index
>
> IndexSearcher.collectionStatistics("TEXT")) returns
> "field="TEXT",maxDoc=2,docCount=2,sumTotalTermFreq=5,sumDocFreq=5" ->
> OK
>
> then I update one of the document and call commit()
>
> IndexSearcher.collectionStatistics("TEXT")) returns
> "field="TEXT",maxDoc=3,docCount=3,sumTotalTermFreq=9,sumDocFreq=9" ->
> NOK
>
> If I call close() now
>
> IndexSearcher.collectionStatistics("TEXT")) returns
> "field="TEXT",maxDoc=2,docCount=2,sumTotalTermFreq=6,sumDocFreq=6" ->
> OK
>
> Note that the counts are correct if the index contains only one document.
>
>
> I attached a test case.
>
> Am I doing something wrong somewhere?
>
>
> Julien
>
>
> ----------------------------
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Incorrect CollectionStatistics if IndexWriter.close is not called [ In reply to ]
I'm indexing medical documents, and documents on emerging topics are
updated quite often.
For example, right now, "COVID" will be overrepresented in my index
because deleted documents are still counted, and then a "COVID" query
will have a lower score than a query on a "unfashionable" topic,
because the idf also takes into account the "number of documents
containing term".

I did not expect this behaviour but I can understand that it's needed
for performance reasons and the only thing I can think of to have
accurate scoring, it's to reindex my documents more often and I don't
always have this luxury.

Thanks for your replies :)

Le lun. 1 mars 2021 à 20:47, Diego Ceccarelli (BLOOMBERG/ LONDON)
<dceccarelli4@bloomberg.net> a écrit :
>
> I'm not sure that closing and opening the index writer will always work - I think the 'problem' will be solved once the segment with the deleted document will be merged with another segment - that might happen during the close but might also *not* happen (e.g., if you have only one segment, and you delete, probably closing/opening won't fix).
>
> Can you describe your problem that you are trying to solve? why do you need such accuracy? if this is for some type of scoring the ranking shouldn't be affected if you have X or X-1 documents in the collection...
>
> Cheers,
> diego
>
> From: java-user@lucene.apache.org At: 03/01/21 16:23:48To: Diego Ceccarelli (BLOOMBERG/ LONDON ) , java-user@lucene.apache.org
> Subject: Re: Incorrect CollectionStatistics if IndexWriter.close is not called
>
> Hi,
>
> You're right the documentation of Terms.getDocCount says that "this
> measure does not take deleted documents into account".
> So if we want correct counts and correct query scores, the IndexWriter
> has to be closed after documents are deleted/updated and a new one has
> to be created when new documents arrive.
>
> Thanks
>
> Le dim. 28 févr. 2021 à 17:04, Diego Ceccarelli (BLOOMBERG/ LONDON)
> <dceccarelli4@bloomberg.net> a écrit :
> >
> > I *guess* it's due to the fact that the update is implemented as remove and
> reinsert the document. Deletes in Lucene are lazy: the deleted document is just
> flagged as deleted in a bitmap and then removed from the index only when
> segments are merged. Did you check IndexSearcher.collectionStatistic
> documentation? it should mention something about that..
> >
> > Cheers,
> > diego
> >
> >
> > From: java-user@lucene.apache.org At: 02/28/21 11:09:52To:
> java-user@lucene.apache.org
> > Subject: Incorrect CollectionStatistics if IndexWriter.close is not called
> >
> > Hi,
> >
> > I don't understand if I'm doing something wrong or if it is the
> > expected behaviour.
> >
> > My problem is when a document is updated the collectionStatistics
> > returns counts as if a new document is added in the index, even after
> > a call to IndexWriter.commit and to
> > SearcherManager.maybeRefreshBlocking.
> > If I call the IndexWriter.close, the counts are correct again, but the
> > documentation of IndexWriter.close says to try to reuse the
> > IndexWriter so I'm a bit confused.
> >
> > Ex:
> > If I add two documents to an empty index
> >
> > IndexSearcher.collectionStatistics("TEXT")) returns
> > "field="TEXT",maxDoc=2,docCount=2,sumTotalTermFreq=5,sumDocFreq=5" ->
> > OK
> >
> > then I update one of the document and call commit()
> >
> > IndexSearcher.collectionStatistics("TEXT")) returns
> > "field="TEXT",maxDoc=3,docCount=3,sumTotalTermFreq=9,sumDocFreq=9" ->
> > NOK
> >
> > If I call close() now
> >
> > IndexSearcher.collectionStatistics("TEXT")) returns
> > "field="TEXT",maxDoc=2,docCount=2,sumTotalTermFreq=6,sumDocFreq=6" ->
> > OK
> >
> > Note that the counts are correct if the index contains only one document.
> >
> >
> > I attached a test case.
> >
> > Am I doing something wrong somewhere?
> >
> >
> > Julien
> >
> >
> > ----------------------------
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Incorrect CollectionStatistics if IndexWriter.close is not called [ In reply to ]
Marc, you don't need to reindex to have less deletes and less impact from
this. merging will get rid of the deletes.

if updates are coming in batches, you could consider calling
IndexWriter.html#forceMergeDeletes after updating a batch to keep things
tidy.

Otherwise, if updates are coming in continuously at all hours, it gets
trickier, but you can still adjust things such as merge policy parameters
so that deletes are more aggressively merged away in a continuous fashion
(at the cost of increased merging of course): Look at stuff such as
setDeletesPctAllowed on TieredMergePolicy.

On Tue, Mar 2, 2021 at 4:35 AM Marc F <xfrontlinex@gmail.com> wrote:

> I'm indexing medical documents, and documents on emerging topics are
> updated quite often.
> For example, right now, "COVID" will be overrepresented in my index
> because deleted documents are still counted, and then a "COVID" query
> will have a lower score than a query on a "unfashionable" topic,
> because the idf also takes into account the "number of documents
> containing term".
>
> I did not expect this behaviour but I can understand that it's needed
> for performance reasons and the only thing I can think of to have
> accurate scoring, it's to reindex my documents more often and I don't
> always have this luxury.
>
> Thanks for your replies :)
>
> Le lun. 1 mars 2021 à 20:47, Diego Ceccarelli (BLOOMBERG/ LONDON)
> <dceccarelli4@bloomberg.net> a écrit :
> >
> > I'm not sure that closing and opening the index writer will always work
> - I think the 'problem' will be solved once the segment with the deleted
> document will be merged with another segment - that might happen during
> the close but might also *not* happen (e.g., if you have only one segment,
> and you delete, probably closing/opening won't fix).
> >
> > Can you describe your problem that you are trying to solve? why do you
> need such accuracy? if this is for some type of scoring the ranking
> shouldn't be affected if you have X or X-1 documents in the collection...
> >
> > Cheers,
> > diego
> >
> > From: java-user@lucene.apache.org At: 03/01/21 16:23:48To: Diego
> Ceccarelli (BLOOMBERG/ LONDON ) , java-user@lucene.apache.org
> > Subject: Re: Incorrect CollectionStatistics if IndexWriter.close is not
> called
> >
> > Hi,
> >
> > You're right the documentation of Terms.getDocCount says that "this
> > measure does not take deleted documents into account".
> > So if we want correct counts and correct query scores, the IndexWriter
> > has to be closed after documents are deleted/updated and a new one has
> > to be created when new documents arrive.
> >
> > Thanks
> >
> > Le dim. 28 févr. 2021 à 17:04, Diego Ceccarelli (BLOOMBERG/ LONDON)
> > <dceccarelli4@bloomberg.net> a écrit :
> > >
> > > I *guess* it's due to the fact that the update is implemented as
> remove and
> > reinsert the document. Deletes in Lucene are lazy: the deleted document
> is just
> > flagged as deleted in a bitmap and then removed from the index only when
> > segments are merged. Did you check IndexSearcher.collectionStatistic
> > documentation? it should mention something about that..
> > >
> > > Cheers,
> > > diego
> > >
> > >
> > > From: java-user@lucene.apache.org At: 02/28/21 11:09:52To:
> > java-user@lucene.apache.org
> > > Subject: Incorrect CollectionStatistics if IndexWriter.close is not
> called
> > >
> > > Hi,
> > >
> > > I don't understand if I'm doing something wrong or if it is the
> > > expected behaviour.
> > >
> > > My problem is when a document is updated the collectionStatistics
> > > returns counts as if a new document is added in the index, even after
> > > a call to IndexWriter.commit and to
> > > SearcherManager.maybeRefreshBlocking.
> > > If I call the IndexWriter.close, the counts are correct again, but the
> > > documentation of IndexWriter.close says to try to reuse the
> > > IndexWriter so I'm a bit confused.
> > >
> > > Ex:
> > > If I add two documents to an empty index
> > >
> > > IndexSearcher.collectionStatistics("TEXT")) returns
> > > "field="TEXT",maxDoc=2,docCount=2,sumTotalTermFreq=5,sumDocFreq=5" ->
> > > OK
> > >
> > > then I update one of the document and call commit()
> > >
> > > IndexSearcher.collectionStatistics("TEXT")) returns
> > > "field="TEXT",maxDoc=3,docCount=3,sumTotalTermFreq=9,sumDocFreq=9" ->
> > > NOK
> > >
> > > If I call close() now
> > >
> > > IndexSearcher.collectionStatistics("TEXT")) returns
> > > "field="TEXT",maxDoc=2,docCount=2,sumTotalTermFreq=6,sumDocFreq=6" ->
> > > OK
> > >
> > > Note that the counts are correct if the index contains only one
> document.
> > >
> > >
> > > I attached a test case.
> > >
> > > Am I doing something wrong somewhere?
> > >
> > >
> > > Julien
> > >
> > >
> > > ----------------------------
> > >
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > >
> > >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>