Mailing List Archive: RangeFilter false positives

RangeFilter false positives

projects at pudge

Apr 9, 2007, 8:59 PM

Post #1 of 15 (4115 views)

If I have no documents within a RangeFilter's range, then the filter
returns all (otherwise matching) documents, or just a couple of matches.

Playing with 512-range_filter.t, I get various unexpected results.

my @letters = 3 .. 7;

my $results = test_filtered_search(
field => 'name',
lower_term => 1,
upper_term => 2,
include_upper => 1,
include_lower => 1,
);

I expect that to give no results, but it gives that $letters[0] and
$letters[1] (3, and 4) match. If I set include_upper to 0, I get the
expected results (none). Changing include_lower has no effect. This seems
quite wrong to me.

However, if instead of:

lower_term => 1,
upper_term => 2,

I have:

lower_term => 8,
upper_term => 9,

(that is, the bounds are *greater* than the existing document values), then
*all* the results are returned (regardless of the values of include_*).

I believe both behaviors (the include_upper thing, and the returning all
results thing) are bugs, but I am not entirely sure. I do know they are
not the behavior I want. :-)

--
Chris Nandor pudge@pobox.com http://pudge.net/
Open Source Technology Group pudge@ostg.com http://ostg.com/

RangeFilter false positives [ In reply to ]

projects at pudge

Apr 9, 2007, 10:13 PM

Post #2 of 15 (4062 views)

At 20:47 -0700 2007.04.09, Chris Nandor wrote:
>If I have no documents within a RangeFilter's range, then the filter
>returns all (otherwise matching) documents, or just a couple of matches.

Here's a patch that seems to work, but I am not entirely sure it's right.

However, a new problem appears to have arisen. So I could test ranges
before and after, I changed the @letters from:

my @letters = 'a' .. 'z';

to

my @letters = 'f' .. 't';

And this works fine, until the final tests, where it tests multi-segment
indexes, and then it segfaults on search(). If I change @letters to 'f' ..
'u', then it works (note 'f'..'t' is 15 letters ... just short of two full
bytes).

I hope all this is enough to fix whatever the problem may be.

Thanks,

--
Chris Nandor pudge@pobox.com http://pudge.net/
Open Source Technology Group pudge@ostg.com http://ostg.com/
-------------- next part --------------
A non-text attachment was scrubbed...
Name: patch-20070409.txt
Type: application/octet-stream
Size: 6595 bytes
Desc: not available
Url : http://www.rectangular.com/pipermail/kinosearch/attachments/20070409/cd9572f4/patch-20070409.obj

RangeFilter false positives [ In reply to ]

marvin at rectangular

Apr 11, 2007, 12:31 AM

Post #3 of 15 (4069 views)

On Apr 9, 2007, at 10:01 PM, Chris Nandor wrote:

> At 20:47 -0700 2007.04.09, Chris Nandor wrote:
>> If I have no documents within a RangeFilter's range, then the filter
>> returns all (otherwise matching) documents, or just a couple of
>> matches.
>
> Here's a patch that seems to work, but I am not entirely sure it's
> right.

I've been over this code once before; it's still buggy, and you can't
quite figure out how to fix it... All that suggests to me that a
refactoring pass for the benefit of clarity is needed, like the one
BitVector got recently.

> And this works fine, until the final tests, where it tests multi-
> segment
> indexes, and then it segfaults on search().

I'm nearly done with the BooleanScorer2 port, which is the precursor
to fixing the memory errors in Tally. That's a blocking issue, so
I'm prioritizing it. There's a strong possibility that the segfaults
may in fact arise there rather than in RangeFilter. They're sneaky
little bastages.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/

RangeFilter false positives [ In reply to ]

marvin at rectangular

Apr 12, 2007, 6:59 PM

Post #4 of 15 (4060 views)

On Apr 9, 2007, at 10:01 PM, Chris Nandor wrote:

> Here's a patch that seems to work, but I am not entirely sure it's
> right.

I applied your patch verbatim as repository revision 2301, then went
and refactored this part of RangeFilter for clarity as 2302.

It turns out your fix was "wrong" in principle but induced correct
behavior -- it detoured around some bogus code rather than
eliminating the bogus code. The test modifications in your patch
were very helpful in verifying correct behavior during refactoring,
and the way in which your fix was "wrong" suggested a better model to
replace the flawed one you had difficulty grokking. Thanks all around!

Would you be interested in adding a new feature to RangeFilter's
API? If you can successfully modify the new code and feel confident
that you got it right, that would indicate that the refactoring went
well.

Right now, all five constructor arguments are required. However, we
should have an omitted or undef value for lower_term indicate "no
lower bound" and an omitted or undef value for upper_term indicate
"no upper bound".

Similarly, we should enable default false values for include_lower
and include_upper.

> And this works fine, until the final tests, where it tests multi-
> segment
> indexes, and then it segfaults on search().

After completing the BooleanScorer2 port and eliminating the memory
errors from Tally, I found that this segfault persisted. It
originated within MultiTermList_Seek:

==21278== Invalid read of size 4
==21278== at 0x4816F00: kino_TLCache_get_term (TermListCache.c:59)
==21278== by 0x4814D96: kino_MultiTermList_seek (MultiTermList.c:188)
==21278== by 0x47F670E: XS_KinoSearch__Index__TermList_seek
(KinoSearch.xs:2663)
==21278== by 0x80CE25E: Perl_pp_entersub (pp_hot.c:2877)
==21278== by 0x80B1A5F: Perl_runops_debug (dump.c:1459)
==21278== by 0x8062902: S_run_body (perl.c:2361)
==21278== by 0x80624BE: perl_run (perl.c:2283)
==21278== by 0x805EA02: main (perlmain.c:99)
==21278== Address 0x0 is not stack'd, malloc'd or (recently) free'd

The fix was somewhat complex, but important. It was committed as
revision 2300.

Subversion trunk now tests clean under valgrind and is safe to use
again.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/

RangeFilter false positives [ In reply to ]

marvin at rectangular

Apr 12, 2007, 7:04 PM

Post #5 of 15 (4052 views)

On Apr 12, 2007, at 6:47 PM, Marvin Humphrey wrote:
> Similarly, we should enable default false values for include_lower
> and include_upper.

Um, make that *true* values. Having "from A to Z" include A and Z by
default is clearly the most intuitive behavior.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/

RangeFilter false positives [ In reply to ]

projects at pudge

Apr 12, 2007, 11:54 PM

Post #6 of 15 (4059 views)

At 18:47 -0700 2007.04.12, Marvin Humphrey wrote:
>It turns out your fix was "wrong" in principle but induced correct
>behavior -- it detoured around some bogus code rather than
>eliminating the bogus code.

Yeah, I figured. But I figured it also might help you narrow it all down.
Even when I saw the first commit come through, I thought a fix for it was
probably coming next. :-)

>Would you be interested in adding a new feature to RangeFilter's
>API? If you can successfully modify the new code and feel confident
>that you got it right, that would indicate that the refactoring went
>well.
>
>Right now, all five constructor arguments are required. However, we
>should have an omitted or undef value for lower_term indicate "no
>lower bound" and an omitted or undef value for upper_term indicate
>"no upper bound".
>
>Similarly, we should enable default false values for include_lower
>and include_upper.

Sure ... I am just not entirely sure what the resulting value should be for
"no lower bound" and "no upper bound." Should no lower be -2? 0? And
should no upper be ... num_docs?

Or should this instead be smarter and skip those bounds checks? For
example, return "-3" to mean no bound check, then:

// no upper or lower bound
if (data->lower_bound == -3 && data->upper_bound == -3) {
data->inner_coll->collect(data->inner_coll, doc_num, score);
}
// no lower bound
else if (data->lower_bound == -3 && locus <= data->upper_bound) {
data->inner_coll->collect(data->inner_coll, doc_num, score);
}
// no upper bound
else if (data->upper_bound == -3 && locus >= data->lower_bound) {
data->inner_coll->collect(data->inner_coll, doc_num, score);
}
else if (locus >= data->lower_bound && locus <= data->upper_bound) {
data->inner_coll->collect(data->inner_coll, doc_num, score);
}

>Subversion trunk now tests clean under valgrind and is safe to use
>again.

Awesome.

At 18:52 -0700 2007.04.12, Marvin Humphrey wrote:
>On Apr 12, 2007, at 6:47 PM, Marvin Humphrey wrote:
>> Similarly, we should enable default false values for include_lower
>> and include_upper.
>
>Um, make that *true* values. Having "from A to Z" include A and Z by
>default is clearly the most intuitive behavior.

Nod. I am sometimes a poor judge of what is most intuitive to others, but
that is what I would expect. :-)

--
Chris Nandor pudge@pobox.com http://pudge.net/
Open Source Technology Group pudge@ostg.com http://ostg.com/

RangeFilter false positives [ In reply to ]

projects at pudge

Apr 13, 2007, 12:21 AM

Post #7 of 15 (4056 views)

Oh, and here are the patches I have for this. I am not crazy about the -3
thing, but it seems to work, so I included that until I get guidance from
you.

Also, a RangeFilter with no bounds at all seems silly, so not sure if you
want to return an error, or let the caller just make a range filter that
includes everything. For now, such a "silly" filter is explicitly allowed.

--
Chris Nandor pudge@pobox.com http://pudge.net/
Open Source Technology Group pudge@ostg.com http://ostg.com/
-------------- next part --------------
A non-text attachment was scrubbed...
Name: patch-20070413.txt
Type: application/octet-stream
Size: 4421 bytes
Desc: not available
Url : http://www.rectangular.com/pipermail/kinosearch/attachments/20070413/e49c231c/patch-20070413.obj

RangeFilter false positives [ In reply to ]

projects at pudge

Apr 13, 2007, 12:27 AM

Post #8 of 15 (4071 views)

Argh. Small error in test. New patch. Bedtime now.

--
Chris Nandor pudge@pobox.com http://pudge.net/
Open Source Technology Group pudge@ostg.com http://ostg.com/
-------------- next part --------------
A non-text attachment was scrubbed...
Name: patch-20070413-2.txt
Type: application/octet-stream
Size: 4396 bytes
Desc: not available
Url : http://www.rectangular.com/pipermail/kinosearch/attachments/20070413/81129a21/patch-20070413-2.obj

RangeFilter false positives [ In reply to ]

marvin at rectangular

Apr 13, 2007, 12:29 AM

Post #9 of 15 (4061 views)

On Apr 12, 2007, at 11:42 PM, Chris Nandor wrote:

> Sure ... I am just not entirely sure what the resulting value
> should be for
> "no lower bound" and "no upper bound." Should no lower be -2? 0?
> And
> should no upper be ... num_docs?

No lower should probably be 0.

No upper should be Reader->max_doc.

The range filter uses an IntMap object, which is essentially just an
array of i32_t. Array index corresponds to doc num and value
corresponds to "term number".

Term number is the would-be array index of the term if all the terms
in that field were marshalled into an array. Docs without a term in
that field are assigned -1 by default.

-1 is the lowest possible value that can appear in the array;
therefore assigning -1 or less to lower_bound means that no document
will be excluded by the lower bound test, even those with no value
for the field. Also, assigning -2 to the upper_bound means that no
hits will be collected.

Reader->max_doc works out to 1 greater than the highest possible term
number. RangeFilter can only be used against an unanalyzed field,
which means we have at most one unique term per document. Reader-
>max_doc is 1 greater than the maximum number of documents;
therefore, it is also 1 greater than the maximum number of terms in
an unanalyzed field.

>
> Or should this instead be smarter and skip those bounds checks? For
> example, return "-3" to mean no bound check, then:
>
> // no upper or lower bound
> if (data->lower_bound == -3 && data->upper_bound == -3) {
> data->inner_coll->collect(data->inner_coll, doc_num, score);
> }
> // no lower bound
> else if (data->lower_bound == -3 && locus <= data->upper_bound) {
> data->inner_coll->collect(data->inner_coll, doc_num, score);
> }
> // no upper bound
> else if (data->upper_bound == -3 && locus >= data->lower_bound) {
> data->inner_coll->collect(data->inner_coll, doc_num, score);
> }
> else if (locus >= data->lower_bound && locus <= data-
> >upper_bound) {
> data->inner_coll->collect(data->inner_coll, doc_num, score);
> }

This is an implementation detail, so it's not something I'm gonna
sweat too hard. However, my mild preference is to resolve this in
RangeFilter.pm and keep the HitCollector code, which is a performance-
critical inner loop, as simple as possible. These conditionals
aren't going to make a difference, but as a general rule, I try to
keep the inner loop stuff uncluttered.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/

RangeFilter false positives [ In reply to ]

projects at pudge

Apr 13, 2007, 12:33 AM

Post #10 of 15 (4060 views)

OK, new patch coming soon, ignore the others.

--
Chris Nandor pudge@pobox.com http://pudge.net/
Open Source Technology Group pudge@ostg.com http://ostg.com/

RangeFilter false positives [ In reply to ]

projects at pudge

Apr 13, 2007, 12:39 AM

Post #11 of 15 (4062 views)

Here you go. I think the tests are sufficient.

--
Chris Nandor pudge@pobox.com http://pudge.net/
Open Source Technology Group pudge@ostg.com http://ostg.com/
-------------- next part --------------
A non-text attachment was scrubbed...
Name: patch-20070413-3.txt
Type: application/octet-stream
Size: 3173 bytes
Desc: not available
Url : http://www.rectangular.com/pipermail/kinosearch/attachments/20070413/416673b9/patch-20070413-3.obj

RangeFilter false positives [ In reply to ]

marvin at rectangular

Apr 13, 2007, 9:20 AM

Post #12 of 15 (4062 views)

On Apr 13, 2007, at 12:26 AM, Chris Nandor wrote:

> Here you go. I think the tests are sufficient.

Thanks! I applied this, with some changes, as revision 2304.

* Failure to supply at least one of either lower_term
or upper_term throws an exception.
* POD updated to reflect API changes.
* undef values allowed for include_lower and include_upper
(indicating false).

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/

RangeFilter false positives [ In reply to ]

mwexler at tias

Apr 13, 2007, 9:27 AM

Post #13 of 15 (4069 views)

Marvin Humphrey wrote:
>
> Thanks! I applied this, with some changes, as revision 2304.
>
>
> * undef values allowed for include_lower and include_upper
> (indicating false).
Didn't you say that by default include_lower and include_upper would be
true. The example you gave was of a range of 'A' through 'Z' and how, by
default, you would expect this to include both A and Z.

RangeFilter false positives [ In reply to ]

projects at pudge

Apr 13, 2007, 9:33 AM

Post #14 of 15 (4069 views)

At 9:16 -0700 2007.04.13, Mike Wexler wrote:
>Didn't you say that by default include_lower and include_upper would be
>true. The example you gave was of a range of 'A' through 'Z' and how, by
>default, you would expect this to include both A and Z.

What he means is that my patch made it so include_lower/include_upper had
to be defined (either by default, or with the supplied value), but he made
it so you can supply an undef value. The default will still be a true
value, if no value is supplied.

This is, of course, reasonable, because why should it throw an exception if
it is false, since false is an allowed value, and the supply of the value
is optional? I was thinking people would do include_lower => 0, but
there's no reason someone shouldn't be allowed to do include_lower => undef.

--
Chris Nandor pudge@pobox.com http://pudge.net/
Open Source Technology Group pudge@ostg.com http://ostg.com/

RangeFilter false positives [ In reply to ]

marvin at rectangular

Apr 13, 2007, 10:20 AM

Post #15 of 15 (4066 views)

On Apr 13, 2007, at 9:20 AM, Chris Nandor wrote:

> What he means is

--->8 SNIP 8<---

Exactly. It's a pretty minor point, but consistent with all the
other parameter verification throughout KS. Checks for undef are
generally reserved for verifying whether someone has failed to supply
a required parameter.

Keeping this one (it was there before -- Chris simply left it in)
once a default was established seemed odd. Especially so because the
error message, which you could only have triggered by specifically
supplying an undef, would have upbraided you for failing to supply a
value.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/