Mailing List Archive: exposing per-field storage usage

exposing per-field storage usage

msokolov at gmail

Jun 13, 2022, 10:22 AM

Post #1 of 9 (483 views)

At Amazon, we have a need to produce regular metrics on how much disk
storage is consumed by each field. We manage an index with data
contributed by many teams and business units and we are often asked to
produce reports attributing index storage usage to these customers.
The best tool we have for this today is based on a custom Codec that
separates storage by field; to get the statistics we read an existing
index and write it out using AddIndexes and force-merging, using the
custom codec. This is time-consuming and inefficient and tends not to
get done.

I wonder if it would make sense to add methods to *some* API that
would expose a per-field disk space metric? If we don't want to add to
IndexReader, which would imply lots of intermediate methods and API
additions, maybe we could make it be computed by CheckIndex?

(implementation note: For the current formats, the information for
each field is always segregated by field, I think. I suppose that in
theory we might want to have some shared data structure across fields
some day, but it seems like an edge case that we could handle in some
exceptional way.)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: exposing per-field storage usage [ In reply to ]

nhat.nguyen at elastic

Jun 13, 2022, 12:26 PM

Post #2 of 9 (483 views)

Hi Michael,

We developed a similar functionality in Elasticsearch. The DiskUsage API
<https://github.com/elastic/elasticsearch/pull/74051> estimates the storage
of each field by iterating its structures (i.e., inverted index,
doc-values, stored fields, etc.) and tracking the number of read-bytes. The
result is pretty fast and accurate.

I am +1 to the proposal.

Thanks,
Nhat

On Mon, Jun 13, 2022 at 1:22 PM Michael Sokolov <msokolov@gmail.com> wrote:

> At Amazon, we have a need to produce regular metrics on how much disk
> storage is consumed by each field. We manage an index with data
> contributed by many teams and business units and we are often asked to
> produce reports attributing index storage usage to these customers.
> The best tool we have for this today is based on a custom Codec that
> separates storage by field; to get the statistics we read an existing
> index and write it out using AddIndexes and force-merging, using the
> custom codec. This is time-consuming and inefficient and tends not to
> get done.
>
> I wonder if it would make sense to add methods to *some* API that
> would expose a per-field disk space metric? If we don't want to add to
> IndexReader, which would imply lots of intermediate methods and API
> additions, maybe we could make it be computed by CheckIndex?
>
> (implementation note: For the current formats, the information for
> each field is always segregated by field, I think. I suppose that in
> theory we might want to have some shared data structure across fields
> some day, but it seems like an edge case that we could handle in some
> exceptional way.)
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>

Re: exposing per-field storage usage [ In reply to ]

Jun 13, 2022, 12:33 PM

Post #3 of 9 (483 views)

+1

Will really help with visibility.

On Tue, 14 Jun 2022, 00:56 Nhat Nguyen, <nhat.nguyen@elastic.co.invalid>
wrote:

> Hi Michael,
>
> We developed a similar functionality in Elasticsearch. The DiskUsage API
> <https://github.com/elastic/elasticsearch/pull/74051> estimates the
> storage of each field by iterating its structures (i.e., inverted index,
> doc-values, stored fields, etc.) and tracking the number of read-bytes. The
> result is pretty fast and accurate.
>
> I am +1 to the proposal.
>
> Thanks,
> Nhat
>
> On Mon, Jun 13, 2022 at 1:22 PM Michael Sokolov <msokolov@gmail.com>
> wrote:
>
>> At Amazon, we have a need to produce regular metrics on how much disk
>> storage is consumed by each field. We manage an index with data
>> contributed by many teams and business units and we are often asked to
>> produce reports attributing index storage usage to these customers.
>> The best tool we have for this today is based on a custom Codec that
>> separates storage by field; to get the statistics we read an existing
>> index and write it out using AddIndexes and force-merging, using the
>> custom codec. This is time-consuming and inefficient and tends not to
>> get done.
>>
>> I wonder if it would make sense to add methods to *some* API that
>> would expose a per-field disk space metric? If we don't want to add to
>> IndexReader, which would imply lots of intermediate methods and API
>> additions, maybe we could make it be computed by CheckIndex?
>>
>> (implementation note: For the current formats, the information for
>> each field is always segregated by field, I think. I suppose that in
>> theory we might want to have some shared data structure across fields
>> some day, but it seems like an edge case that we could handle in some
>> exceptional way.)
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: dev-help@lucene.apache.org
>>
>>

Re: exposing per-field storage usage [ In reply to ]

rcmuir at gmail

Jun 13, 2022, 8:27 PM

Post #4 of 9 (483 views)

On Mon, Jun 13, 2022 at 3:26 PM Nhat Nguyen
<nhat.nguyen@elastic.co.invalid> wrote:
>
> Hi Michael,
>
> We developed a similar functionality in Elasticsearch. The DiskUsage API estimates the storage of each field by iterating its structures (i.e., inverted index, doc-values, stored fields, etc.) and tracking the number of read-bytes. The result is pretty fast and accurate.
>
> I am +1 to the proposal.
>

I like an approach such as this, enumerate the index, using something
like FilterDirectory to track the bytes. It doesn't require you to
force-merge all the data through addIndexes, and at the same time it
doesn't invade the codec apis.
The user can always force-merge the data themselves for situations
such as benchmarks/tracking space over time, otherwise the
fluctuations from merges could create too much noise.
Personally, I would suggest separate api/tool from CheckIndex, perhaps
this tracking could mask bugs? No reason to mix the two concerns.
Also, the tool can be much more efficient than checkindex, e.g. for
stored fields and vectors it can just retrieve the first and last
documents, whereas checkindex should verify all of the documents
slowly.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: exposing per-field storage usage [ In reply to ]

nhat.nguyen at elastic

Jun 13, 2022, 9:08 PM

Post #5 of 9 (483 views)

> Also, the tool can be much more efficient than checkindex, e.g. for
> stored fields and vectors it can just retrieve the first and last
> documents, whereas checkindex should verify all of the documents
> slowly.

Yes, we implemented a similar heuristic in the DiskUsage API in
Elasticsearch.

On Mon, Jun 13, 2022 at 11:27 PM Robert Muir <rcmuir@gmail.com> wrote:

> On Mon, Jun 13, 2022 at 3:26 PM Nhat Nguyen
> <nhat.nguyen@elastic.co.invalid> wrote:
> >
> > Hi Michael,
> >
> > We developed a similar functionality in Elasticsearch. The DiskUsage API
> estimates the storage of each field by iterating its structures (i.e.,
> inverted index, doc-values, stored fields, etc.) and tracking the number of
> read-bytes. The result is pretty fast and accurate.
> >
> > I am +1 to the proposal.
> >
>
> I like an approach such as this, enumerate the index, using something
> like FilterDirectory to track the bytes. It doesn't require you to
> force-merge all the data through addIndexes, and at the same time it
> doesn't invade the codec apis.
> The user can always force-merge the data themselves for situations
> such as benchmarks/tracking space over time, otherwise the
> fluctuations from merges could create too much noise.
> Personally, I would suggest separate api/tool from CheckIndex, perhaps
> this tracking could mask bugs? No reason to mix the two concerns.
> Also, the tool can be much more efficient than checkindex, e.g. for
> stored fields and vectors it can just retrieve the first and last
> documents, whereas checkindex should verify all of the documents
> slowly.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>

Re: exposing per-field storage usage [ In reply to ]

msokolov at gmail

Jun 14, 2022, 7:37 AM

Post #6 of 9 (483 views)

Oh, yes that's a clever idea. It seems it would take quite a while
(tens of minutes?) for a larger index though? Much faster than the
force-merge solution for sure. I guess to get faster we would have to
instrument each format. I mean they generally do know how much space
each field is occupying, but perhaps it's too much API change to
expose that.

On Tue, Jun 14, 2022 at 12:09 AM Nhat Nguyen
<nhat.nguyen@elastic.co.invalid> wrote:
>
>
>> Also, the tool can be much more efficient than checkindex, e.g. for
>> stored fields and vectors it can just retrieve the first and last
>> documents, whereas checkindex should verify all of the documents
>> slowly.
>
>
> Yes, we implemented a similar heuristic in the DiskUsage API in Elasticsearch.
>
> On Mon, Jun 13, 2022 at 11:27 PM Robert Muir <rcmuir@gmail.com> wrote:
>>
>> On Mon, Jun 13, 2022 at 3:26 PM Nhat Nguyen
>> <nhat.nguyen@elastic.co.invalid> wrote:
>> >
>> > Hi Michael,
>> >
>> > We developed a similar functionality in Elasticsearch. The DiskUsage API estimates the storage of each field by iterating its structures (i.e., inverted index, doc-values, stored fields, etc.) and tracking the number of read-bytes. The result is pretty fast and accurate.
>> >
>> > I am +1 to the proposal.
>> >
>>
>> I like an approach such as this, enumerate the index, using something
>> like FilterDirectory to track the bytes. It doesn't require you to
>> force-merge all the data through addIndexes, and at the same time it
>> doesn't invade the codec apis.
>> The user can always force-merge the data themselves for situations
>> such as benchmarks/tracking space over time, otherwise the
>> fluctuations from merges could create too much noise.
>> Personally, I would suggest separate api/tool from CheckIndex, perhaps
>> this tracking could mask bugs? No reason to mix the two concerns.
>> Also, the tool can be much more efficient than checkindex, e.g. for
>> stored fields and vectors it can just retrieve the first and last
>> documents, whereas checkindex should verify all of the documents
>> slowly.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: dev-help@lucene.apache.org
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: exposing per-field storage usage [ In reply to ]

rcmuir at gmail

Jun 14, 2022, 8:14 AM

Post #7 of 9 (483 views)

On Tue, Jun 14, 2022 at 10:37 AM Michael Sokolov <msokolov@gmail.com> wrote:
>
> Oh, yes that's a clever idea. It seems it would take quite a while
> (tens of minutes?) for a larger index though? Much faster than the
> force-merge solution for sure. I guess to get faster we would have to
> instrument each format. I mean they generally do know how much space
> each field is occupying, but perhaps it's too much API change to
> expose that.

Why tens of minutes? That simple first doc/last doc works for the term
vectors and docvalues too. For the postings, Terms.java has methods
getMin() and getMax(), so it is possible to seek to the first and last
term for the field.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: exposing per-field storage usage [ In reply to ]

nhat.nguyen at elastic

Jun 14, 2022, 8:42 AM

Post #8 of 9 (483 views)

I didn't test with TB indices, but the API took around 100-300ms to analyze
a GB index.

On Tue, Jun 14, 2022 at 11:15 AM Robert Muir <rcmuir@gmail.com> wrote:

> On Tue, Jun 14, 2022 at 10:37 AM Michael Sokolov <msokolov@gmail.com>
> wrote:
> >
> > Oh, yes that's a clever idea. It seems it would take quite a while
> > (tens of minutes?) for a larger index though? Much faster than the
> > force-merge solution for sure. I guess to get faster we would have to
> > instrument each format. I mean they generally do know how much space
> > each field is occupying, but perhaps it's too much API change to
> > expose that.
>
> Why tens of minutes? That simple first doc/last doc works for the term
> vectors and docvalues too. For the postings, Terms.java has methods
> getMin() and getMax(), so it is possible to seek to the first and last
> term for the field.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>

Re: exposing per-field storage usage [ In reply to ]

msokolov at gmail

Jun 14, 2022, 11:29 AM

Post #9 of 9 (483 views)

OK sorry, I must have misread the timings in the issue you forwarded!
Maybe confusing secs with ms or so

On Tue, Jun 14, 2022 at 11:43 AM Nhat Nguyen
<nhat.nguyen@elastic.co.invalid> wrote:
>
> I didn't test with TB indices, but the API took around 100-300ms to analyze a GB index.
>
> On Tue, Jun 14, 2022 at 11:15 AM Robert Muir <rcmuir@gmail.com> wrote:
>>
>> On Tue, Jun 14, 2022 at 10:37 AM Michael Sokolov <msokolov@gmail.com> wrote:
>> >
>> > Oh, yes that's a clever idea. It seems it would take quite a while
>> > (tens of minutes?) for a larger index though? Much faster than the
>> > force-merge solution for sure. I guess to get faster we would have to
>> > instrument each format. I mean they generally do know how much space
>> > each field is occupying, but perhaps it's too much API change to
>> > expose that.
>>
>> Why tens of minutes? That simple first doc/last doc works for the term
>> vectors and docvalues too. For the postings, Terms.java has methods
>> getMin() and getMax(), so it is possible to seek to the first and last
>> term for the field.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: dev-help@lucene.apache.org
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org