Mailing List Archive

_write_postings hanging in _02
I added some entries to an index, deleted them all, then added them again,
then deleted them again. On the second time through, on the delete, when I
call finish on the writer, _write_postings ends up hanging.

Mac OS X 10.4, perl 5.8.4.

--
Chris Nandor pudge@pobox.com http://pudge.net/
Open Source Technology Group pudge@ostg.com http://ostg.com/
_write_postings hanging in _02 [ In reply to ]
On Mar 7, 2007, at 5:33 PM, Chris Nandor wrote:

> I added some entries to an index, deleted them all, then added them
> again,
> then deleted them again. On the second time through, on the
> delete, when I
> call finish on the writer, _write_postings ends up hanging.

Thanks for the report. I've reproduced and identified the problem,
and am working on a fix.

Your narrative brings up a related issue, recently uncovered. As
currently implemented, delete_by_term() only operates on documents
which were already in the index before the indexing session started.
So if you do this...

$invindexer->add_doc( { content => 'foo' } );
$invindexer->delete_by_term( content => 'foo' );
$invindexer->finish;

... the doc added just before the call to delete_by_term() won't be
deleted. That's because docs added via add_doc exist in limbo until
finish() is called; it's not possible to search against them until
the segment is completed.

Changing things so that that doc gets deleted would be hard -- it
would be necessary to cache each term, then go back and delete docs
later, but only ones added before that particular call to
delete_by_term(). The implementation would have to be elaborate and
therefore both fragile and brittle. Either that or you'd have to
write out the segment before each call to delete_by_term, which is a
non-starter -- performance would nosedive.

My inclination is to document the method's actual behavior, but I
don't think that's enough -- the name delete_by_term suggests a
certain behavior (I'm thinking SQL DELETE with a WHERE clause) and
it's bad design to have it do something subtly different. Perhaps
renaming the method to something more descriptive, like
"delete_existing" would help.

I haven't decided what to do yet. Thoughts?

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
_write_postings hanging in _02 [ In reply to ]
On Mar 7, 2007, at 5:33 PM, Chris Nandor wrote:

> I added some entries to an index, deleted them all, then added them
> again,
> then deleted them again. On the second time through, on the
> delete, when I
> call finish on the writer, _write_postings ends up hanging.

I got a bus error from trying to dereference a NULL pointer. This
appears to be the fix, and was committed as r2130.


slothbear:~/projects/ks/perl marvin$ svn diff ../c_src/KinoSearch/
Index/PostingsWriter.c
Index: ../c_src/KinoSearch/Index/PostingsWriter.c
===================================================================
--- ../c_src/KinoSearch/Index/PostingsWriter.c (revision 2113)
+++ ../c_src/KinoSearch/Index/PostingsWriter.c (working copy)
@@ -227,6 +227,9 @@
if ( iter == -1 ) { /* never true; can only get here via
goto */
/* prepare to clear out buffers and exit loop */
FINAL_ITER: {
+ /* skip final iter if there are no terms */
+ if (outstream == NULL)
+ break;
iter = -1;
REFCOUNT_DEC(term_text);
term_text = (ViewByteBuf*)BB_new(0);
slothbear:~/projects/ks/perl marvin$


The test code I used to reproduce the bug is below. Please let me
know whether the flow is the same as yours, since I didn't get a hang.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/

#--------------------------------------------------------------------

my $invindex = KinoSearch::InvIndex->create(
folder => KinoSearch::Store::RAMFolder->new,
schema => TestSchema->new,
);
my $invindexer = KinoSearch::InvIndexer->new( invindex => $invindex );

$invindexer->add_doc( { content => $_ } ) for 'a' .. 'c';
$invindexer->finish;
$invindexer = KinoSearch::InvIndexer->new( invindex => $invindex );
$invindexer->delete_by_term( content => $_ ) for 'a' .. 'c';
$invindexer->finish;
$invindexer = KinoSearch::InvIndexer->new( invindex => $invindex );
$invindexer->add_doc( { content => $_ } ) for 'a' .. 'c';
$invindexer->finish;
$invindexer = KinoSearch::InvIndexer->new( invindex => $invindex );
$invindexer->delete_by_term( content => $_ ) for 'a' .. 'c';
$invindexer->finish;
_write_postings hanging in _02 [ In reply to ]
At 19:35 -0800 2007.03.07, Marvin Humphrey wrote:
>Thanks for the report. I've reproduced and identified the problem,
>and am working on a fix.

Great, thanks.


>Your narrative brings up a related issue, recently uncovered. As
>currently implemented, delete_by_term() only operates on documents
>which were already in the index before the indexing session started.

...

>My inclination is to document the method's actual behavior, but I
>don't think that's enough -- the name delete_by_term suggests a
>certain behavior (I'm thinking SQL DELETE with a WHERE clause) and
>it's bad design to have it do something subtly different. Perhaps
>renaming the method to something more descriptive, like
>"delete_existing" would help.
>
>I haven't decided what to do yet. Thoughts?

Well, for now, you can just document it. :-) I can't think of any easy
way to do it offhand, and there are more important issues to deal with.

I am not sure if this can actually come up with our code. It probably can,
so I need to take a closer look.

The way our code works is that there's a queue of changes to index, so I
can run through "new" index items first and then look at "modified" and
"deleted" items, and skip those if the same ID is in the "new" list, to be
picked up at the next run. Just a few lines of code, probably.


Oh, one more thing: are you planning (*cough*) on a dump_index replacement?
I know the schema is required which complicates things a bit.

--
Chris Nandor pudge@pobox.com http://pudge.net/
Open Source Technology Group pudge@ostg.com http://ostg.com/
_write_postings hanging in _02 [ In reply to ]
At 20:08 -0800 2007.03.07, Marvin Humphrey wrote:
>I got a bus error from trying to dereference a NULL pointer. This
>appears to be the fix, and was committed as r2130.

>The test code I used to reproduce the bug is below. Please let me
>know whether the flow is the same as yours, since I didn't get a hang.

The flow is basically the same, yes. And I no longer get the hang. Yay!

I was getting a bus error earlier, though; although I am not sure how much
was pilot error.

I was updating my searcher code, and previously I had been setting the
offset passed to seek() using $hits->total_hits. But now I can't get that
before I call seek(), and as a result, I was passing num_wanted => 0 to
search(). This bug in my code causing a bus error in KS.

That said, I wonder if 0 or something similar might be a way to denote
"send everything." My workaround now is to send $reader->num_docs instead,
which is fine too, I think.

--
Chris Nandor pudge@pobox.com http://pudge.net/
Open Source Technology Group pudge@ostg.com http://ostg.com/
_write_postings hanging in _02 [ In reply to ]
On Mar 7, 2007, at 8:08 PM, Chris Nandor wrote:

> Oh, one more thing: are you planning (*cough*) on a dump_index
> replacement?
> I know the schema is required which complicates things a bit.

Hmm. I'd forgotten about it; it's still in the distro, moved to the
devel folder, but no longer compatible.

It would be easy to adapt to the present API, but you'd have to pass
the class name of the schema as a command line arg or something like
that:

dump_index MySchema /path/to/invindex

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
_write_postings hanging in _02 [ In reply to ]
At 20:49 -0800 2007.03.07, Marvin Humphrey wrote:
>On Mar 7, 2007, at 8:08 PM, Chris Nandor wrote:
>
>> Oh, one more thing: are you planning (*cough*) on a dump_index
>> replacement?
>> I know the schema is required which complicates things a bit.
>
>Hmm. I'd forgotten about it; it's still in the distro, moved to the
>devel folder, but no longer compatible.
>
>It would be easy to adapt to the present API, but you'd have to pass
>the class name of the schema as a command line arg or something like
>that:
>
> dump_index MySchema /path/to/invindex

Oops, I should have been more specific. I got that far, but some of the
other code isn't working, like get_finfos.

--
Chris Nandor pudge@pobox.com http://pudge.net/
Open Source Technology Group pudge@ostg.com http://ostg.com/
_write_postings hanging in _02 [ In reply to ]
On Mar 7, 2007, at 8:36 PM, Chris Nandor wrote:

> I was updating my searcher code, and previously I had been setting the
> offset passed to seek() using $hits->total_hits.

I don't understand what the use is of this, unless it's to naively
retrieve the last (worst) matches.

FYI, in KS 0.15 and earlier, calling total_hits before seek()
actually triggers a call to seek(0, 100) internally. It's not
possible to know how many documents a query matches without running
the whole scoring routine.

Note that there's not much difference between calling seek(0, 10) and
seek(0, 100). The only change is the size of a priority queue; the
cost of matching and scoring remains the same.

KS 0.15 also performed unnecessary seeks in some cases -- for
instance, calling seek(0, 10) when you've already called seek(0, 100)
shouldn't be necessary, but KS was doing that if you called seek(0,
10) after total_hits(). This has changed in 0.20. Credit to Henry
for identifying the issue.

> But now I can't get that
> before I call seek(),

While you were "able" to get it before, you still had doubled costs.

> and as a result, I was passing num_wanted => 0 to
> search(). This bug in my code causing a bus error in KS.

Heh. I'll go fix that.

> That said, I wonder if 0 or something similar might be a way to denote
> "send everything."

There are memory and performance implications for setting a large
num_wanted. Hits are collected in a priority queue, and the size of
the queue is determined by num_wanted.

> My workaround now is to send $reader->num_docs instead,
> which is fine too, I think.

That will work -- sort of. If your index is large, that's gonna be a
huge priority queue. Each element in the queue is either a ScoreDoc
(16 bytes) or, when sorting, a FieldDoc (20 bytes presently, and
probably about to grow to take in an arbitrary string).

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
_write_postings hanging in _02 [ In reply to ]
At 21:26 -0800 2007.03.07, Marvin Humphrey wrote:
>I don't understand what the use is of this, unless it's to naively
>retrieve the last (worst) matches.

To get all the matches. :-)

In this case, actually, it was a workaround for not having sorting in KS.
I am storing only the IDs in KS, so I would get the IDs of all the matches,
and pass them back to MySQL. Hopefully I won't need this workaround soon;
tomorrow I start playing with sorting.

--
Chris Nandor pudge@pobox.com http://pudge.net/
Open Source Technology Group pudge@ostg.com http://ostg.com/
_write_postings hanging in _02 [ In reply to ]
On Mar 7, 2007, at 8:55 PM, Chris Nandor wrote:

>> It would be easy to adapt to the present API, but you'd have to pass
>> the class name of the schema as a command line arg or something like
>> that:
>>
>> dump_index MySchema /path/to/invindex
>
> Oops, I should have been more specific. I got that far, but some
> of the
> other code isn't working, like get_finfos.

Done.

Repository revision 2133.

Also required an update to to_string in c_src/KinoSearch/Index/Term.c.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
_write_postings hanging in _02 [ In reply to ]
On Mar 7, 2007, at 8:36 PM, Chris Nandor wrote:

> I was updating my searcher code, and previously I had been setting the
> offset passed to seek() using $hits->total_hits. But now I can't
> get that
> before I call seek(), and as a result, I was passing num_wanted =>
> 0 to
> search(). This bug in my code causing a bus error in KS.

Fixed. num_wanted => 0 no longer crashes.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/