Mailing List Archive: Proposal to Reimplement Disk Usage API - Request for Feedback and Collaboration

Dear Community

I am writing to share thoughts on the existing Disk Usage API, I believe
there is an opportunity to improve its functionality and performance
through a reimplementation.
Currently, the best tool we have for this is based on a custom Codec that
separates storage by field; to get the statistics we read an existing index
and write it out using AddIndexes and force-merging, using the custom
codec. This is time-consuming and inefficient and tends not to get done.
What we could do is similar to the functionality in Elasticsearch. The
DiskUsage API <https://github.com/elastic/elasticsearch/pull/74051>
estimates the storage of each field by iterating its structures (i.e.,
inverted index, doc-values, stored fields, etc.) and tracking the number of
read-bytes. Since we will enumerate the index, it wouldn't require us to
force-merge all the data through addIndexes, and at the same time it
doesn't invade the codec apis.

Thank you for your time and consideration. I would greatly appreciate any
input, suggestions, or concerns you might have regarding this proposal and
eagerly look forward to your response.

Best regards,

Hi Deepika, that would be a welcome addition - we had an earlier
discussion about it; see the thread here:
https://markmail.org/message/hq7jvobsnxwp7iat

Please be careful not to copy the code from Elastic as it is not
shared under an open license that permits copying

On Wed, May 24, 2023 at 3:19?PM Deepika Sharma
<deeps.sharma0510@gmail.com> wrote:
>
> Dear Community
>
> I am writing to share thoughts on the existing Disk Usage API, I believe
> there is an opportunity to improve its functionality and performance
> through a reimplementation.
> Currently, the best tool we have for this is based on a custom Codec that
> separates storage by field; to get the statistics we read an existing index
> and write it out using AddIndexes and force-merging, using the custom
> codec. This is time-consuming and inefficient and tends not to get done.
> What we could do is similar to the functionality in Elasticsearch. The
> DiskUsage API <https://github.com/elastic/elasticsearch/pull/74051>
> estimates the storage of each field by iterating its structures (i.e.,
> inverted index, doc-values, stored fields, etc.) and tracking the number of
> read-bytes. Since we will enumerate the index, it wouldn't require us to
> force-merge all the data through addIndexes, and at the same time it
> doesn't invade the codec apis.
>
> Thank you for your time and consideration. I would greatly appreciate any
> input, suggestions, or concerns you might have regarding this proposal and
> eagerly look forward to your response.
>
> Best regards,

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org