Mailing List Archive: find documents with big stored fields

find documents with big stored fields

Jul 1, 2019, 2:23 AM

Post #1 of 4 (632 views)

Hello,

We are currently trying to investigate an issue where in the index-size is
disproportionally large for the number of documents. We see that the .fdt
file is more than 10 times the regular size.

Reading the docs, I found that this file contains the fielddata.

I would like to find the documents and/or field names/contents with extreme
sizes, so we can delete those from the index without needing to re-index
all data.

What would be the best approach for this?

Thanks,
Rob Audenaerde

Re: find documents with big stored fields [ In reply to ]

lucene at mikemccandless

Jul 1, 2019, 2:57 AM

Post #2 of 4 (632 views)

Permalink

Hi Rob,

The codec records per docid how many bytes each document consumes -- maybe
instrument the codec's sources locally, then open your index and have it
visit stored fields for every doc in the index and gather stats?

Or, to avoid touching Lucene level code, you could make a small tool that
load stored fields for each doc, gather stats on total string length and
stored field count of all fields in the doc?

Mike McCandless

http://blog.mikemccandless.com

On Mon, Jul 1, 2019 at 5:24 AM Rob Audenaerde <rob.audenaerde@gmail.com>
wrote:

> Hello,
>
> We are currently trying to investigate an issue where in the index-size is
> disproportionally large for the number of documents. We see that the .fdt
> file is more than 10 times the regular size.
>
> Reading the docs, I found that this file contains the fielddata.
>
> I would like to find the documents and/or field names/contents with extreme
> sizes, so we can delete those from the index without needing to re-index
> all data.
>
> What would be the best approach for this?
>
> Thanks,
> Rob Audenaerde
>

Re: find documents with big stored fields [ In reply to ]

erickerickson at gmail

Jul 1, 2019, 9:02 AM

Post #3 of 4 (632 views)

Permalink

Whoa.

First, it should be pretty easy to figure out what fields are large, just look at your input documents. The fdt files are really simple, they’re just the compressed raw data. Numeric fields, for instance, are just character data in the fdt files. We usually see about a 2:1 ratio. There’s no need to look into Lucene. You’d have fun getting the info from Lucene anyway since the way stored fields are compressed is on a document basis, not on a field basis. Oh, I guess you could probably get there through a bunch of low-level Lucene code, but it’d be far faster to just look at your input.

Second, look at your schema. Why are you storing certain fields? In particular are you storing the _destination_ of any copyField? You don’t need to, nor should you.

Third, just changing stored=“true” to stored=“false” will _not_ change the index in any way until existing docs are re-indexed. When an existing document is re-indexed (or deleted), the doc is only marked as deleted in the segment it happens to be in. That data is not reclaimed until that segment is merged, which will happen sometime but not necessarily immediately.

Fourth, fdt files aren’t particularly germane to searching, just retrieving the result list. It’s not good to have the index be unnecessarily large, but the presence of all that stored data is (probably) minimally affecting search speed. When replicas go into full recovery moving extra data lengthens the process, and if you’re returning large result lists reading and decompressing lots of data (assuming not returning only docValues fields) is added work, but in the usual case of returning 10-20 results it’s usually not that big a deal. I’d still remove unnecessary stored fields, but wouldn’t consider it urgent. Just change the definition and continue as normal, things will get smaller over time.

So “bottom line”
- I claim you can look at your documents and know, with a high degree of accuracy, what’s contributing to your fdt file size.
- You should check your schema to see if you’re doing any copyFields where the destination has stored=“true” and change those.
- You’ll have to re-index your docs to see the data size shrink. Note that what segments are merged is opaque, don’t expect the index to shrink until you’ve re-indexed quite a number of docs. New segments should have much smaller fdt files relative to the sum of the other files in that segment.

Best,
Erick

> On Jul 1, 2019, at 2:23 AM, Rob Audenaerde <rob.audenaerde@gmail.com> wrote:
>
> Hello,
>
> We are currently trying to investigate an issue where in the index-size is
> disproportionally large for the number of documents. We see that the .fdt
> file is more than 10 times the regular size.
>
> Reading the docs, I found that this file contains the fielddata.
>
> I would like to find the documents and/or field names/contents with extreme
> sizes, so we can delete those from the index without needing to re-index
> all data.
>
> What would be the best approach for this?
>
> Thanks,
> Rob Audenaerde

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: find documents with big stored fields [ In reply to ]

ab at getopt

Jul 2, 2019, 9:28 AM

Post #4 of 4 (623 views)

Permalink

What version of Solr? In Solr 8.2 there will be a tool to facilitate this kind of analysis - see SOLR-13512. In the meantime, if you’re on Solr 8.x you should be able to easily back port this change to your version (7x should be possible too, but with more changes).

> On 1 Jul 2019, at 11:23, Rob Audenaerde <rob.audenaerde@gmail.com> wrote:
>
> Hello,
>
> We are currently trying to investigate an issue where in the index-size is
> disproportionally large for the number of documents. We see that the .fdt
> file is more than 10 times the regular size.
>
> Reading the docs, I found that this file contains the fielddata.
>
> I would like to find the documents and/or field names/contents with extreme
> sizes, so we can delete those from the index without needing to re-index
> all data.
>
> What would be the best approach for this?
>
> Thanks,
> Rob Audenaerde

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org