Mailing List Archive

out of memory
Hello,

When indexing with 0.20_02, I'm finding that I'm running out of memory
before the indexing process ends. I didn't run into this in 0.15. Is
there something that I might not be doing correctly? I can buy more
RAM, but I'd rather not right now. When the indexing dies, the space
consumed by the files is around 4 GB.

Thanks
out of memory [ In reply to ]
On Mar 28, 2007, at 7:11 AM, Roger Dooley wrote:

> When indexing with 0.20_02,

What's your actual config? I know you're using the new Tokenizer,
but that's not in 0.20_02. Did you copy just Tokenizer.pm into
0.20_02, or did you check out from subversion?

> I'm finding that I'm running out of memory before the indexing
> process ends. I didn't run into this in 0.15.

You shouldn't be running into it with 0.20_xx either. The underlying
model hasn't changed.

When documents are added via InvIndexer, the verbatim stored fields
are written right away. The inverted data - the "postings" -- are
serialized and dumped into an external sort pool. This sort pool
flushes every 16 MB or so. The end result is that most indexing
processes hit a hard knee somewhere in the neighborhood of 30 MB of
RAM usage, allowing for differences in document size. There's a
negligible creep upwards after that.

> I can buy more RAM, but I'd rather not right now.

I doubt that would help. This is a probably a bug.

However, in our last thread, we discussed a hack for setting the
memory threshold for the external sort pool. Are you still doing
that? If so, what are you setting it to?

> When the indexing dies, the space consumed by the files is around 4
> GB.

If I can duplicate this problem, I can solve it. Probably the first
angle should be to run the benchmarker with --docs=10000000000 and
see if memory plateaus as it does with 0.15. I'll go give that a whirl.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
out of memory [ In reply to ]
Marvin Humphrey (3/28/2007 12:03 PM) wrote:
>
> On Mar 28, 2007, at 7:11 AM, Roger Dooley wrote:
>
>> When indexing with 0.20_02,
>
> What's your actual config? I know you're using the new Tokenizer, but
> that's not in 0.20_02. Did you copy just Tokenizer.pm into 0.20_02, or
> did you check out from subversion?
>

New Tokenizer from the previous week, but the rest is from 0.20_02.

>> I'm finding that I'm running out of memory before the indexing process
>> ends. I didn't run into this in 0.15.
>
> You shouldn't be running into it with 0.20_xx either. The underlying
> model hasn't changed.
>
> When documents are added via InvIndexer, the verbatim stored fields are
> written right away. The inverted data - the "postings" -- are
> serialized and dumped into an external sort pool. This sort pool
> flushes every 16 MB or so. The end result is that most indexing
> processes hit a hard knee somewhere in the neighborhood of 30 MB of RAM
> usage, allowing for differences in document size. There's a negligible
> creep upwards after that.
>
>> I can buy more RAM, but I'd rather not right now.
>
> I doubt that would help. This is a probably a bug.
>
> However, in our last thread, we discussed a hack for setting the memory
> threshold for the external sort pool. Are you still doing that? If so,
> what are you setting it to?
>

I've commented that part out for this round of indexing. I can try
setting this again and see what happens. Anything else I can try to
figure out what is going on?

Thanks

>> When the indexing dies, the space consumed by the files is around 4 GB.
>
> If I can duplicate this problem, I can solve it. Probably the first
> angle should be to run the benchmarker with --docs=10000000000 and see
> if memory plateaus as it does with 0.15. I'll go give that a whirl.
>
> Marvin Humphrey
> Rectangular Research
> http://www.rectangular.com/
>
>
>
> _______________________________________________
> KinoSearch mailing list
> KinoSearch@rectangular.com
> http://www.rectangular.com/mailman/listinfo/kinosearch
out of memory [ In reply to ]
On Mar 28, 2007, at 9:23 AM, Roger Dooley wrote:

> Marvin Humphrey (3/28/2007 12:03 PM) wrote:
>> On Mar 28, 2007, at 7:11 AM, Roger Dooley wrote:
>>> When indexing with 0.20_02,
>> What's your actual config? I know you're using the new Tokenizer,
>> but that's not in 0.20_02. Did you copy just Tokenizer.pm into
>> 0.20_02, or did you check out from subversion?
>
> New Tokenizer from the previous week, but the rest is from 0.20_02.

OK. Unfortunately, I can't duplicate this issue using either that
config, or subversion trunk. In both cases, memory usage plateaus at
33.8 MB on my box for the benchmarking script.

> I've commented that part out for this round of indexing. I can try
> setting this again and see what happens.

It would be better to leave it at its default. My only concern was
the remote possibility that it was set to a value that was causing
the problem.

> Anything else I can try to figure out what is going on?

Troubleshooting memory leaks isn't easy. Here's what I would do,
which is not the same as what I recommend you do:

1) Move the problem script to a Linux system if it's
not there already.
2) Compile a debugging perl from 5.8.8 sources.
3) Run devel/valgrind_test.plx using the debugging perl.
4) Examine the output for memory leaks.

If none show up, then the problem is script specific.

5) Run the script under valgrind and debug perl. The
environment variable PERL_DESTRUCT_LEVEL has to be
set to 2 and the suppressions file devel/p588_valgrind.supp
has to be fed to valgrind. (Peek the commands that
valgrind_test.plx runs.)
6) If nothing turns up after indexing a few documents
and exiting cleanly, invoke the script under valgrind
and debug perl again, but let it run for a long time
and then crash it intentionally so that Perl doesn't
run its cleanup routines. Then examine valgrind's
output looking for clues as to where the memory went.

Hopefully at that point we'd be able to narrow down the search to
KS's perl code (not likely), KS's C code (likely), or your script
itself (quite possible -- could be a black hole hash, for example).

What I recommend you do is attempt to duplicate the problem so that I
can hunt it down. Create a script I'll be able to run and monitor
its memory usage using top. Use the us_constitution HTML
presentation if you can. If the footprint keeps growing long past 30
MB, send it my way.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
out of memory [ In reply to ]
On Mar 28, 2007, at 10:52 AM, Marvin Humphrey wrote:

> the benchmarking script.

FYI, the benchmarking script that shipped with 0.20_02 is b0rken.
It's been fixed in svn trunk, but there are some leaks and errors in
trunk right now.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
out of memory [ In reply to ]
On 3/28/07, Roger Dooley <dooley@wolfram.com> wrote:

> When indexing with 0.20_02, I'm finding that I'm running out of memory
> before the indexing process ends. (...) When the indexing dies, the space
> consumed by the files is around 4 GB.

Dear Roger,

did you find a solution for your problem? I tried full-text indexing
approx. 1000000 documents and ran into a similar situation, when total
filesize reached about 4-5 GB. (using 02 and the newer tokenizer
file) However as far as I recall memory was ok, but indexing almost
stopped.

In general, what would be an ideal indexing strategy - build a single
large index - build many smaller ones and merge them? I will try the
second option tomorrow and report the results.
out of memory [ In reply to ]
Karel K. (4/3/2007 6:17 PM) wrote:
> On 3/28/07, Roger Dooley <dooley@wolfram.com> wrote:
>
>> When indexing with 0.20_02, I'm finding that I'm running out of memory
>> before the indexing process ends. (...) When the indexing dies, the
>> space
>> consumed by the files is around 4 GB.
>
> Dear Roger,
>
> did you find a solution for your problem? I tried full-text indexing
> approx. 1000000 documents and ran into a similar situation, when total
> filesize reached about 4-5 GB. (using 02 and the newer tokenizer
> file) However as far as I recall memory was ok, but indexing almost
> stopped.
>
> In general, what would be an ideal indexing strategy - build a single
> large index - build many smaller ones and merge them? I will try the
> second option tomorrow and report the results.

I was able to incrementally index the documents. I would index two or
three directories at a time. I kept adding the newly indexed files to
the newly created index. This seems to have worked, but I need to test
it some more.

-Roger
out of memory [ In reply to ]
On Apr 3, 2007, at 3:17 PM, Karel K. wrote:

> did you find a solution for your problem? I tried full-text indexing
> approx. 1000000 documents and ran into a similar situation, when total
> filesize reached about 4-5 GB. (using 02 and the newer tokenizer
> file) However as far as I recall memory was ok, but indexing almost
> stopped.

FWIW, I'm always concerned about such reports. I just made a
reasonably diligent attempt to duplicate Roger's problem here and
failed, so I can't proceed. There are many things for me to work on,
so I have to rely on people supplying me with failing test cases for
certain kinds of bugs. If I get such a case for this bug, I'll
restart my debugging efforts.

> In general, what would be an ideal indexing strategy - build a single
> large index - build many smaller ones and merge them?

Either strategy should be fine. A single index will probably take a
fraction less total time.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/