Mailing List Archive

I'm getting fewer than expected results when supplying multiple fields
Hi there,

I'm using the devel version (0.20_05). My index contains 74330
documents. I have created a field in that index which is set to the
same value for all records so that I can easily retrieve every
document (useful whilst testing). i.e. search.pl q="all:1" Calling
$hits->total_hits on that search gives me '74330' which is what I
expect.

Performing a search on another field, fieldx:foo, gives me 4481 hits.
I have confirmed that this quantity is correct for this field.

When I do the following : search.pl q="all:1 AND fieldx:foo", I get a
lower quantity of 4449 hits. I've lost 32 documents.

I've diffed the output of the two searches involving fieldx to
identify the missing documents and manually reviewed their contents.
As far as I can see, they *should* be included, as the missing
documents do contain 'all:1' and also contain 'fieldx:foo'.

In MySchema, the 'all' and 'fieldx' fields are indexed and stored.
They aren't being analyzed.

I'm constructing my query using QueryParser, with the default_boolop
set to 'AND'.

#-----
my $qp = KinoSearch::QueryParser->new(
schema => MySchema->new,
default_boolop => 'AND',
);
$qp->set_heed_colons(1);

my $reader = KinoSearch::Index::IndexReader->open(
invindex => MySchema->read( './index' ),
);

my $searcher = KinoSearch::Searcher->new( reader => $reader );

my $hits = $searcher->search(
query => $qp->parse( $q ),
offset => 0,
num_wanted => 100000,
);

warn $hits->total_hits;
#-----

I have also noticed my results vary depending upon the order of the
fields in my query. Searching for 'fieldx:foo AND all:1' yields 4453
hits. I don't expect this to happen as the 'all:1' is true for every
document in the index.

Why would the 3 searches not yield the same results? As I understand
it, the queries are effectively the same - although KinoSearch doesn't
treat them so.

Many thanks,

Adam

_______________________________________________
KinoSearch mailing list
KinoSearch@rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch
Re: I'm getting fewer than expected results when supplying multiple fields [ In reply to ]
Hello Adam,

Thanks for the detailed report.

> I'm using the devel version (0.20_05).

Was this index originally built under 0.20_04, and does it have
deletions? That's one known bug, leading to index corruption.

Also, how many segments are in the index? (You can tell at a glance
by counting files with a ".cf" extension within the index directory.)

> Performing a search on another field, fieldx:foo, gives me 4481 hits.
> I have confirmed that this quantity is correct for this field.
>
> When I do the following : search.pl q="all:1 AND fieldx:foo", I get a
> lower quantity of 4449 hits. I've lost 32 documents.

This behavior suggests a bug in either ANDScorer or one of the
PostingList subclasses.

Think of a PostingList as an array of document numbers associated
with a particular term. You have two PostingLists (inside
TermScorers), and it's ANDScorer's job to take the intersection.
Conceptually, it's simple enough. However, PostingList cannot be
implemented as an array because that wouldn't scale -- under the
hood, it's a iterator reading compressed records off of disk.

One possibility is that PostingList is reading records incorrectly,
so that the iterated doc nums don't match what ought to be in that
array. That was what happened with the old deletions bug: the stream
got out of sync because the data was garbage, and if KS didn't
segfault outright, the results were incorrect.

The second possibility is that PostingList is fine, but ANDScorer is
performing the intersection improperly.

> Why would the 3 searches not yield the same results?

They should.

There are two stages of compilation for that particular query string:
QueryParser produces a BooleanQuery, and BooleanQuery produces a
BooleanScorer wrapping an ANDScorer. ANDScorer operates on an array
of subscorers (in this case there would be two TermScorers in the
array), and the order in which the subscorers are arranged matters in
terms of how the intersection algorithm plays out.

My intuition is that if it's not the deletions issue, and that
ANDScorer_skip_to is to blame. The algo, which is very similar to
that used by PhraseScorer, is only mildly convoluted, but it happens
to be hard to write tests for.

If you can supply a failing test case, I will work with that
directly. Otherwise, I'll attempt to improve testing for ANDScorer
and hope that the bug shows itself.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/



_______________________________________________
KinoSearch mailing list
KinoSearch@rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch
Re: I'm getting fewer than expected results when supplying multiple fields [ In reply to ]
On (09/11/07 20:51), Marvin Humphrey wrote:
> Hello Adam,
>
> Thanks for the detailed report.
>
> > I'm using the devel version (0.20_05).
>
> Was this index originally built under 0.20_04, and does it have
> deletions? That's one known bug, leading to index corruption.

No, I created the index using 0.20_05 and it doesn't contain any
deletions.

> Also, how many segments are in the index? (You can tell at a glance
> by counting files with a ".cf" extension within the index directory.)

There is one segment file in the index, which is about 69MB in size.

> > Performing a search on another field, fieldx:foo, gives me 4481 hits.
> > I have confirmed that this quantity is correct for this field.
> >
> > When I do the following : search.pl q="all:1 AND fieldx:foo", I get a
> > lower quantity of 4449 hits. I've lost 32 documents.
>
> This behavior suggests a bug in either ANDScorer or one of the
> PostingList subclasses.
>
> [snip]
>
> One possibility is that PostingList is reading records incorrectly,
> so that the iterated doc nums don't match what ought to be in that
> array.

I did consider that, particularly when I read about you changing them
to start at 1 (rather than zero), but that change doesn't affect
0.20_05. I added in some debug code to output *my* unique identifier
for each document returned, but that didn't reveal anything more to me.

> [snip]
>
> The second possibility is that PostingList is fine, but ANDScorer is
> performing the intersection improperly.
>
> > Why would the 3 searches not yield the same results?
>
> They should.
>
> There are two stages of compilation for that particular query string:
> QueryParser produces a BooleanQuery, and BooleanQuery produces a
> BooleanScorer wrapping an ANDScorer. ANDScorer operates on an array
> of subscorers (in this case there would be two TermScorers in the
> array), and the order in which the subscorers are arranged matters in
> terms of how the intersection algorithm plays out.
>
> My intuition is that if it's not the deletions issue, and that
> ANDScorer_skip_to is to blame. The algo, which is very similar to
> that used by PhraseScorer, is only mildly convoluted, but it happens
> to be hard to write tests for.
>
> If you can supply a failing test case, I will work with that
> directly. Otherwise, I'll attempt to improve testing for ANDScorer
> and hope that the bug shows itself.

I'll try to do that. I have already tried to rebuild the index so that
it only contains the 2 fields mentioned, and 4481 records, but the
results from that index are correct.

I'll strip out the irrelevent code/data and send my data and test case
to you off-list once I've got a refined example.

Thanks,

Adam

_______________________________________________
KinoSearch mailing list
KinoSearch@rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch
Re: I'm getting fewer than expected results when supplying multiple fields [ In reply to ]
On Nov 10, 2007, at 3:23 AM, Adam . wrote:

> I'll strip out the irrelevent code/data and send my data and test case
> to you off-list once I've got a refined example.

I've isolated the problem and can provide a workaround. It turns out
not to be ANDScorer after all. ANDScorer has held up well under more
thorough testing.

Instead, the problem manifests during SegPList_Skip_To. It either
occurs because of something awry in the SegPList_skip_to function
itself, or because either PostingsWriter or LexWriter isn't writing
skip information correctly.

SegPList_Skip_To is only an optimization though. If we disable it,
then SegPostingList inherits the definitive method from Scorer...

bool_t
Scorer_skip_to(Scorer *self, u32_t target)
{
do {
if ( !Scorer_Next(self) )
return false;
} while ( target > Scorer_Doc(self) );

return true;
}

... which produces the correct results.

To implement the workaround, comment out the declaration of Skip_To
in SegPostingList.h...

/*
chy_bool_t
kino_SegPList_skip_to(kino_SegPostingList *self, chy_u32_t target);
KINO_METHOD("Kino_SegPList_Skip_To");
*/

... then run this sequence:

./Build distclean
perl Build.PL
./Build [test, install, code, whatever]

Note: the distclean step is *essential*.

The main consequence of disabling Skip_To is that intersections which
contain at least one rare term will proceed more slowly.

The investigation of this bug has produced some happy side effects.

* The newly introduced MockScorer class has made it possible to write
much more robust tests for ANDScorer; I'll soon be adding similarly
robust tests for ORScorer, ANDORScorer, and ANDNOTScorer.
* Even better, I've more-or-less solved the problem of how override
C methods with Perl methods, making it possible to implement
MockScorer
entirely in Perl. The same technique can be applied for other
classes,
making it possible for instance to write a HitCollector with a Perl
collect() method.

The next step is to figure out what's causing SegPList_Skip_To to
misbehave. Happily, even if the problem is a write-time bug, skip
information is completely recoverable, so it won't be necessary to
regenerate indexes from scratch.

Cheers,

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/



_______________________________________________
KinoSearch mailing list
KinoSearch@rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch
Re: I'm getting fewer than expected results when supplying multiple fields [ In reply to ]
On (15/11/07 15:52), Marvin Humphrey wrote:
> On Nov 10, 2007, at 3:23 AM, Adam . wrote:
>
> >I'll strip out the irrelevent code/data and send my data and test case
> >to you off-list once I've got a refined example.
>
> I've isolated the problem and can provide a workaround. It turns out
> not to be ANDScorer after all. ANDScorer has held up well under more
> thorough testing.
>
> Instead, the problem manifests during SegPList_Skip_To. It either
> occurs because of something awry in the SegPList_skip_to function
> itself, or because either PostingsWriter or LexWriter isn't writing
> skip information correctly.
>
> SegPList_Skip_To is only an optimization though. If we disable it,
> then SegPostingList inherits the definitive method from Scorer...
>
> bool_t
> Scorer_skip_to(Scorer *self, u32_t target)
> {
> do {
> if ( !Scorer_Next(self) )
> return false;
> } while ( target > Scorer_Doc(self) );
>
> return true;
> }
>
> ... which produces the correct results.
>
> To implement the workaround, comment out the declaration of Skip_To
> in SegPostingList.h...
>
> /*
> chy_bool_t
> kino_SegPList_skip_to(kino_SegPostingList *self, chy_u32_t target);
> KINO_METHOD("Kino_SegPList_Skip_To");
> */
>
> ... then run this sequence:
>
> ./Build distclean
> perl Build.PL
> ./Build [test, install, code, whatever]
>
> Note: the distclean step is *essential*.

I have commented out the above code, but got this fatal error when doing the
./Build install step :

In file included from c_src/KinoSearch/Index/SegPostingList.c:5:
c_src/r/KinoSearch/Index/SegPostingList.r:200: error: 'kino_SegPList_skip_to' undeclared here (not in a function)
error building c_src/KinoSearch/Index/SegPostingList.o from 'c_src/KinoSearch/Index/SegPostingList.c' at /usr/local/share/perl/5.8.8/ExtUtils/CBuilder/Base.pm line 108.

I commented out line 200 of SegPostingList.r :

(kino_PList_skip_to_t)kino_SegPList_skip_to,

and ran the above 3 build steps again. This time it completed successfully.

I rebuilt my index again which went fine, but when I try my search
script, I get the following error :

Can't locate object method "collect" via package "KinoSearch::Util::VirtualTable" at
/usr/local/lib/perl/5.8.8/KinoSearch/Searcher.pm line 123.

Can you advise?

> The investigation of this bug has produced some happy side effects.

A beneficial bug? I must remember to use that spin at work. :-)

Thanks,

Adam

_______________________________________________
KinoSearch mailing list
KinoSearch@rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch
Re: I'm getting fewer than expected results when supplying multiple fields [ In reply to ]
On Nov 16, 2007, at 6:22 AM, Adam wrote:

> I have commented out the above code, but got this fatal error when
> doing the
> ./Build install step :
>
> In file included from c_src/KinoSearch/Index/SegPostingList.c:5:
> c_src/r/KinoSearch/Index/SegPostingList.r:200: error:
> 'kino_SegPList_skip_to' undeclared here (not in a function)
> error building c_src/KinoSearch/Index/SegPostingList.o from
> 'c_src/KinoSearch/Index/SegPostingList.c' at /usr/local/share/perl/
> 5.8.8/ExtUtils/CBuilder/Base.pm line 108.
>
> I commented out line 200 of SegPostingList.r :
>
> (kino_PList_skip_to_t)kino_SegPList_skip_to,

My bad. The point of the "./Build distclean" step is to trigger the
regeneration of all the .r files from scratch. However, I'd
forgotten that it doesn't do that from an unpacked distro tarball,
only from an svn checkout.

I've set up a temporary branch in svn to handle this bug, since I
don't want to make a quick release from svn trunk:

svn co http://www.rectangular.com/svn/kinosearch/branches/
dev-0.20_05x ks_plist_fix

The SegPList_Skip_To disabling has been committed on that branch.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/



_______________________________________________
KinoSearch mailing list
KinoSearch@rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch
Re: I'm getting fewer than expected results when supplying multiple fields [ In reply to ]
Hi Marvin,

On (16/11/07 12:41), Marvin Humphrey wrote:
> On Nov 16, 2007, at 6:22 AM, Adam wrote:
>
> I've set up a temporary branch in svn to handle this bug, since I
> don't want to make a quick release from svn trunk:
>
> svn co http://www.rectangular.com/svn/kinosearch/branches/
> dev-0.20_05x ks_plist_fix
>
> The SegPList_Skip_To disabling has been committed on that branch.

Excellent, I applied this patch on Monday and it has held up under
extensive testing, without any noticable loss in performance.

Thank you,

Adam

_______________________________________________
KinoSearch mailing list
KinoSearch@rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch