Mailing List Archive

PolyFilter
Marvin, you offered this code:

my $polyfilter = KinoSearch::Search::PolyFilter->new;
$polyfilter->add( filter => $query_filter, logic => 'AND' );
$polyfilter->add( filter => $range_filter, logic => 'AND' );
my $hits = $searcher->search( query => $query, filter => $polyfilter );

I was wondering if occur => 'MUST' might make sense here instead of logic
=> 'AND', to keep similar to BooleanQuery.


Also, I can start on the base Filter class if you like.
KinoSearch::Search::Filter?

--
Chris Nandor pudge@pobox.com http://pudge.net/
Open Source Technology Group pudge@ostg.com http://ostg.com/
PolyFilter [ In reply to ]
On Mar 29, 2007, at 3:18 PM, Chris Nandor wrote:

> Marvin, you offered this code:
>
> my $polyfilter = KinoSearch::Search::PolyFilter->new;
> $polyfilter->add( filter => $query_filter, logic => 'AND' );
> $polyfilter->add( filter => $range_filter, logic => 'AND' );
> my $hits = $searcher->search( query => $query, filter =>
> $polyfilter );
>
> I was wondering if occur => 'MUST' might make sense here instead of
> logic
> => 'AND', to keep similar to BooleanQuery.

Neat idea.

Let's test drive that with some pseudo-code...

my $query_filter = QueryFilter->new("category:bowling");
my $range_filter = RangeFilter->new("between 1900 and 2000");

$polyfilter = PolyFilter->new;
$polyfilter->add( filter => $query_filter, occur => 'MUST' );
$polyfilter->add( filter => $range_filter, occur => 'MUST' );

Looks good. That's clearly "bowling between 1900 and 2000".

my $lowball = RangeFilter->new("between 0 and 10 dollars");
my $good_reps = RangeFilter->new("feedback rating above 100");

$polyfilter = PolyFilter->new;
$polyfilter->add( filter => $lowball, occur => 'MUST_NOT' );
$polyfilter->add( filter => $good_reps, occur => 'MUST' );

That also looks good. "No lowball offers and only good feedback
ratings".

But here's a wrinkle: Filters are black/white. "SHOULD" is sort of a
weird concept in that context. Say we'd prefer buyers from
California, but don't want to exclude buyers from other states.

my $cali = QueryFilter->new("state:California");

# hmm...
$polyfilter->add( filter => $cali, occur => 'SHOULD' );

That doesn't really do what we want. We can't OR that filter's bit
set together that with the prior two filters, or we'll get lowballers
with lousy feedback from California.

Preferring cali buyers can be accomplished by putting that in the
Query rather than the Filter, but we still have the problem of how to
express that we want the polyfilter to be the union of its component
filters.

# ????? Not so good.
$polyfilter = PolyFilter->new;
$polyfilter->add( filter => $cali, occur => 'MAY' );
$polyfilter->add( filter => $oregon, occur => 'MAY' );

There's also no easy way to express XOR using variants of SHOULD,
MUST and friends. That won't be a common need, but it's easy to
implement and I don't see any good reason to make it more clumsy than
it needs to be or to arbitrarily exclude it.

Thoughts?

FWIW, BooleanQuery's interface is less than ideal. It isn't really
about boolean logic. It's about 'required' and 'prohibited' clauses,
with which you can fake up boolean logic.

> Also, I can start on the base Filter class if you like.
> KinoSearch::Search::Filter?

'Zactly.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
PolyFilter [ In reply to ]
At 16:07 -0700 2007.03.29, Marvin Humphrey wrote:
>Thoughts?

Not especially. Let's stick with the standard booleans, AND/OR/NOT/XOR. I
just like consistency, and I didn't realize that SHOULD was not a straight
OR.

--
Chris Nandor pudge@pobox.com http://pudge.net/
Open Source Technology Group pudge@ostg.com http://ostg.com/
PolyFilter [ In reply to ]
At 15:18 -0700 2007.03.29, Chris Nandor wrote:
> $polyfilter->add( filter => $query_filter, logic => 'AND' );
> $polyfilter->add( filter => $range_filter, logic => 'AND' );
> my $hits = $searcher->search( query => $query, filter => $polyfilter );

Hm. In order to get the bits out of the filter, we need a reader.

So unless a reader is passed, I don't see a good way to do this, unless we
delay the AND/OR'ing of the bits until PolyFilter::bits() is called. Which
actually might work fine.

What do you think?

--
Chris Nandor pudge@pobox.com http://pudge.net/
Open Source Technology Group pudge@ostg.com http://ostg.com/
PolyFilter [ In reply to ]
On Mar 29, 2007, at 8:48 PM, Chris Nandor wrote:

> At 15:18 -0700 2007.03.29, Chris Nandor wrote:
>> $polyfilter->add( filter => $query_filter, logic => 'AND' );
>> $polyfilter->add( filter => $range_filter, logic => 'AND' );
>> my $hits = $searcher->search( query => $query, filter =>
>> $polyfilter );
>
> Hm. In order to get the bits out of the filter, we need a reader.
>
> So unless a reader is passed, I don't see a good way to do this,
> unless we
> delay the AND/OR'ing of the bits until PolyFilter::bits() is
> called. Which
> actually might work fine.
>
> What do you think?

I agree; that's the right way to handle it.

FYI, BitVector has now been outfitted with AND, OR, XOR, and AND_NOT.

(I think your choice of 'NOT' is better than AND_NOT for PolyFilter's
API.)

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
PolyFilter [ In reply to ]
At 21:10 -0700 2007.03.29, Marvin Humphrey wrote:
>I agree; that's the right way to handle it.

Good, because I just finished coding it up!

>FYI, BitVector has now been outfitted with AND, OR, XOR, and AND_NOT.

Yep, I've been testing it as they've been added. :)

>(I think your choice of 'NOT' is better than AND_NOT for PolyFilter's
>API.)

OK.

So, I've included the two .pm files. If it looks good to you, I'll do up
some tests tomorrow.

I assume you would like to see a 516-poly_filter.t?

I believe make_collector() would be same for PolyFilter and QueryFilter, so
I put that in the parent class. I am sure you will correct me if I am
wrong.

I also made 'AND' the default logical op, so logic => 'AND' is optional.
You may dislike this, and if so, I'll make it required. The thing that is
kinda weird to me is having a logical op for the first filter added, when
there is nothing to operate against.

--
Chris Nandor pudge@pobox.com http://pudge.net/
Open Source Technology Group pudge@ostg.com http://ostg.com/
-------------- next part --------------
A non-text attachment was scrubbed...
Name: PolyFilter.pm
Type: application/octet-stream
Size: 2770 bytes
Desc: not available
Url : http://www.rectangular.com/pipermail/kinosearch/attachments/20070329/75c79cd5/PolyFilter.obj
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Filter.pm
Type: application/octet-stream
Size: 559 bytes
Desc: not available
Url : http://www.rectangular.com/pipermail/kinosearch/attachments/20070329/75c79cd5/Filter.obj
PolyFilter [ In reply to ]
On Mar 29, 2007, at 9:29 PM, Chris Nandor wrote:

> So, I've included the two .pm files.

Yeeha! I'm a little too beat to give this a thorough review tonight,
but they sho' looks dandy on first gander!

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
PolyFilter [ In reply to ]
At 21:40 -0700 2007.03.29, Marvin Humphrey wrote:
>On Mar 29, 2007, at 9:29 PM, Chris Nandor wrote:
>
>> So, I've included the two .pm files.
>
>Yeeha! I'm a little too beat to give this a thorough review tonight,
>but they sho' looks dandy on first gander!

I made one big mistake:

if (!defined $cached_bits) {
$cached_bits = $bits;

That should obviously be:

if (!defined $cached_bits) {
$cached_bits = $bits->clone;

Oops. The tests revealed the error, though it took me a little while to
track it down. Heh.

Attached is the first run at a test file. Feel free to suggest
modifications to the tests and the rest. I'll be out most of tomorrow,
but I'll try to get any changes you suggest finished and sent back to you
before Saturday.

Thanks,

--
Chris Nandor pudge@pobox.com http://pudge.net/
Open Source Technology Group pudge@ostg.com http://ostg.com/
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 516-poly_filter.t
Type: application/octet-stream
Size: 3021 bytes
Desc: not available
Url : http://www.rectangular.com/pipermail/kinosearch/attachments/20070330/8eeb0392/516-poly_filter-0001.obj
PolyFilter [ In reply to ]
On Mar 29, 2007, at 9:29 PM, Chris Nandor wrote:

> I believe make_collector() would be same for PolyFilter and
> QueryFilter, so
> I put that in the parent class. I am sure you will correct me if I am
> wrong.

Nice work. It's not a public method, so we have the freedom to
refactor if necessary -- but this seems like a good way of doing things.

I've added this to Filter.pm...

sub bits { shift->abstract_death }

... along with some documentation. I try to have every method in KS
have a summary comment at its origin. Admittedly, that origin is not
always easy to find, as it might be in C, XS, or Perl.

> I also made 'AND' the default logical op, so logic => 'AND' is
> optional.
> You may dislike this, and if so, I'll make it required. The thing
> that is
> kinda weird to me is having a logical op for the first filter
> added, when
> there is nothing to operate against.

I concur. Good call.

I'm grateful that the code you supplied is so consistent with the
rest of the KS code base. Verification of arguments to add(), POD
structure... it's obvious you made a conscious effort.

Regarding this...

# XXX what to do here?
confess("No filters") unless @{ $self->{filters} };

I think we should supply an all-pass filter if PolyFilter->add never
gets called. Perhaps the best way to do this is to spec that Filter-
>bits may return undef and that Filter->make_collector may return
the original collector. That would be the cheapest, CPU and memory-
wise, and I don't think the acrobatics required by the code in the
Filter classes will be too strenuous. What do you think?

I had two minor doco quibbles. First, I see why you made the
comparison to BooleanQuery, but the way filters are combined in
PolyFilter differs subtly. Pseudocode:

# returns docs with bar and not foo
$bool_query->NOT('foo');
$bool_query->MUST('bar');

# filters out everything
$polyfilter->NOT('foo');
$polyfilter->MUST('bar');

Therefore, I've changed the description to omit any mention of
BooleanQuery. (I think we'll actually need to fix PolyFilter's
behavior for when the first call to add() specs 'NOT').

Second, I changed this...

B<filter> - the filter object to add to the PolyFilter, which
currently must be a L<QueryFilter|KinoSearch::Search::QueryFilter>
or L<RangeFilter|KinoSearch::Search::RangeFilter> object, or another
PolyFilter object.

... to this:

B<filter> - the <Filter|KinoSearch::Search::Filter> object to add
to the
PolyFilter, which might be a L<QueryFilter|
KinoSearch::Search::QueryFilter>,
a L<RangeFilter|KinoSearch::Search::RangeFilter>, or another
PolyFilter.

By generalizing and leaving the list of possible classes open-ended,
we inoculate ourselves against the possibility that it will go out of
date if we add another Filter subclass. Though it's not always
practical, when we can we should avoid writing documentation that can
be rendered invalid by non-local changes.

All in all, these module files are very nice additions to KS, and I'm
pleased to have added them along with your test file as revision
2241. Thanks!

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
PolyFilter [ In reply to ]
On Mar 30, 2007, at 12:17 AM, Chris Nandor wrote:
> Attached is the first run at a test file. Feel free to suggest
> modifications to the tests and the rest.

I brought one comment up to date, then applied. Thanks!

Here are a few more scenarios that need tests:

* No calls to add().
* Single call to add, once for each possible value of 'logic'.
* A lengthy filter chain.

The last should probably just be a single test, rather than a
comprehensive exercise of all possibilities. I anticipate that it
will pass. :)

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
PolyFilter [ In reply to ]
At 12:00 -0700 2007.03.31, Marvin Humphrey wrote:
>I've added this to Filter.pm...
>
> sub bits { shift->abstract_death }

Yeah, that's good; I figured you'd want something like it, but wasn't quite
sure.


>I'm grateful that the code you supplied is so consistent with the
>rest of the KS code base. Verification of arguments to add(), POD
>structure... it's obvious you made a conscious effort.

I've managed various projects before, so I understand. :)


>I think we should supply an all-pass filter if PolyFilter->add never
>gets called. Perhaps the best way to do this is to spec that Filter-
> >bits may return undef and that Filter->make_collector may return
>the original collector. That would be the cheapest, CPU and memory-
>wise, and I don't think the acrobatics required by the code in the
>Filter classes will be too strenuous. What do you think?

Sounds good at first thought. I'll look into the actual code later and see
for sure.


>Therefore, I've changed the description to omit any mention of
>BooleanQuery.

Nod.


>(I think we'll actually need to fix PolyFilter's
>behavior for when the first call to add() specs 'NOT').

Hm, yeah. I'll work on that too.


>By generalizing and leaving the list of possible classes open-ended,
>we inoculate ourselves against the possibility that it will go out of
>date if we add another Filter subclass.

Nod.

--
Chris Nandor pudge@pobox.com http://pudge.net/
Open Source Technology Group pudge@ostg.com http://ostg.com/
PolyFilter [ In reply to ]
At 12:10 -0700 2007.03.31, Marvin Humphrey wrote:
>Here are a few more scenarios that need tests:

Yeah, I just wanted to get something started, and am grateful for your
suggestions. I'll add those.


> * A lengthy filter chain.

Lengthy as in, just a lot of filters added to one PolyFilter?

And just in case you didn't notice, the test included already has a
PolyFilter added to a PolyFilter, which I was very happy to see Just Work.
:)

--
Chris Nandor pudge@pobox.com http://pudge.net/
Open Source Technology Group pudge@ostg.com http://ostg.com/
PolyFilter [ In reply to ]
On Mar 31, 2007, at 1:40 PM, Chris Nandor wrote:

> Lengthy as in, just a lot of filters added to one PolyFilter?

That's what I was thinking, but y'know, maybe it's not needed. We
know that two in series works fine. And from a "glass box" testing
perspective, it looks unlikely to fail with more.

> And just in case you didn't notice, the test included already has a
> PolyFilter added to a PolyFilter, which I was very happy to see
> Just Work.
> :)

Yeah. Rawk.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
PolyFilter [ In reply to ]
I think I got all of your suggestions in this patch:

* PolyFilter->bits() returns undef if add() not called, make_collector()
returns $inner_coll if bits() returns undef

* poly_filter.t tests no calls to add(), and each possibility of single
calls to add()

* Behavior is different if first call to add() has logic => 'NOT'

For the latter, I added a new C method BitVec_invert, which inverts every
bit, which has the effect of a single-input logical NOT.

--
Chris Nandor pudge@pobox.com http://pudge.net/
Open Source Technology Group pudge@ostg.com http://ostg.com/
-------------- next part --------------
A non-text attachment was scrubbed...
Name: patch-20070401.txt
Type: application/octet-stream
Size: 6469 bytes
Desc: not available
Url : http://www.rectangular.com/pipermail/kinosearch/attachments/20070331/71702608/patch-20070401.obj
PolyFilter [ In reply to ]
On Mar 31, 2007, at 11:29 PM, Chris Nandor wrote:

> I think I got all of your suggestions in this patch:
>
> * PolyFilter->bits() returns undef if add() not called,
> make_collector()
> returns $inner_coll if bits() returns undef
>
> * poly_filter.t tests no calls to add(), and each possibility of
> single
> calls to add()

I found a glitch in this test group: AND was being tested three
times, instead of AND, OR, and XOR. No bugs revealed in the module
code once the test glitch was fixed, though. :)

> * Behavior is different if first call to add() has logic => 'NOT'

I've applied your patch, with the small mod above, as revision 2249.
I believe we can now declare PolyFilter functional. Thank you, and
hooray!

> For the latter, I added a new C method BitVec_invert, which inverts
> every
> bit, which has the effect of a single-input logical NOT.

A big ol' grin stole across my face when I read that, and it got
wider as I perused the patch itself. You nailed every aspect of the
integration (though BitVec_Invert is problematic for subtle
reasons). I'm curious how challenging you found it to grok how to
add your C/XS code.

The thing about BitVec_Invert as implemented in the patch, though, is
that it can flip some bits we don't want flipped. Here's a failing
test I wrote for it:

$bit_vec = KinoSearch::Util::BitVector->new( capacity => 5 );
$bit_vec->set( 1 .. 4 );
$bit_vec->invert; # flips bits 0 .. 7
$bit_vec->set(10); # cap now 11, exposing previously hidden bits
ok( !$bit_vec->get(7),
"invert doesn't set bits outside of valid range" );

BitVector's "cap" variable is an internal number for keeping track of
allocations. It's just there to make sure we never access invalid
memory. It's not the logical size of the bitVector.

In the test above, Invert() flips bits beyond cap -- bits 5, 6, and 7
-- because it does byte-at-a-time bit-flipping. You don't notice
that right away, but after the BitVector grows (which can happen for
a variety of reasons, including BitVec->set with a number greater
than the current cap, as above), those messed up bits get exposed.

That flaw wasn't affecting PolyFilter at present, but that was only
because those cached BitVector objects coincidentally never grow.
The bug needed to be fixed lest it induce some bizarre buggy behavior
somewhere down the line.

The solution was to refactor BitVec_Invert into BitVec_Flip_Range,
and have it take two arguments: from_bit and to_bit. The interface
is roughly based upon java.util.BitSet, as is much of the rest of
BitVector's API.

Flip_Range() is more appropriate because a BitVector doesn't really
have an explicit length, like a Perl array does. Logically, it's a
set of numbers, as determined by the indexes of all bits which are
1. You can't invert "all the elements", because "all the elements"
means "all positive integers", if it means anything at all.

Similar problems were actually afflicting OR, XOR, and BitVec_Count,
but they would have arisen under even more subtle circumstances. I
believe I have everything sorted now, as of commit 2252; I'm glad
that BitVec_Invert brought the problems to my attention.

This was somewhat all somewhat more laborious to work out than I'd
thought it would be; sorry about the delay in responding to your patch.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
PolyFilter [ In reply to ]
At 14:38 -0700 2007.04.02, Marvin Humphrey wrote:
>I found a glitch in this test group: AND was being tested three
>times, instead of AND, OR, and XOR. No bugs revealed in the module
>code once the test glitch was fixed, though. :)

Yeah. I had them correct previously, but was copying/pasting the code when
cleaning it up, and didn't fix those. Thanks for the catch. :)


>> * Behavior is different if first call to add() has logic => 'NOT'
>
>I've applied your patch, with the small mod above, as revision 2249.
>I believe we can now declare PolyFilter functional. Thank you, and
>hooray!

My pleasure, and thanks for the help in figuring it all out.


>> For the latter, I added a new C method BitVec_invert, which inverts
>> every
>> bit, which has the effect of a single-input logical NOT.
>
>A big ol' grin stole across my face when I read that, and it got
>wider as I perused the patch itself. You nailed every aspect of the
>integration (though BitVec_Invert is problematic for subtle
>reasons). I'm curious how challenging you found it to grok how to
>add your C/XS code.

It wasn't that bad. Took me a minute or two to track down where to make
the changes. I've done a bit of XS in my day, though I don't know C so
well (especially bit-twiddling), so I spent quite a bit more time working
on the C than figuring out how to add it, which is probably backward from
most XS hackers. :)


Thanks for cleaning up my patch, looks all good from here.

--
Chris Nandor pudge@pobox.com http://pudge.net/
Open Source Technology Group pudge@ostg.com http://ostg.com/
PolyFilter [ In reply to ]
At 14:38 -0700 2007.04.02, Marvin Humphrey wrote:
>The solution was to refactor BitVec_Invert into BitVec_Flip_Range,
>and have it take two arguments: from_bit and to_bit. The interface
>is roughly based upon java.util.BitSet, as is much of the rest of
>BitVector's API.

There's something not quite right in the new code, looks like an
off-by-one. I found it randomly when using shuffle() on 'a'..'z'. If 'a'
was the 25th letter, and I was setting the bits to look for 'a', 'a' would
remain turned on when flip_range() was called. Here's a smaller test case:

use KinoSearch::Util::BitVector;
my $bit_vec = KinoSearch::Util::BitVector->new;
$bit_vec->flip_range( 0, 10 );
print "oops: $_\n" for grep { !$bit_vec->get($_) } 0..9;

Returns:

oops: 8

Looks to me like the problem is here:

/* flip partial bytes */
while (last % 8 != 0 && last >= first) {
BitVec_flip(self, last);
last--;
}

We are counting from 0, so 7 is the 8th bit, but this stops counting
backward at 8, so the 8th bit should be handled here, but it is skipped.
Here's a late-night and probably subtly incorrect stab at a patch, that
makes all the tests work (including some new ones).

--
Chris Nandor pudge@pobox.com http://pudge.net/
Open Source Technology Group pudge@ostg.com http://ostg.com/
-------------- next part --------------
A non-text attachment was scrubbed...
Name: patch-20070404.txt
Type: application/octet-stream
Size: 2224 bytes
Desc: not available
Url : http://www.rectangular.com/pipermail/kinosearch/attachments/20070404/e3ea3bf2/patch-20070404-0001.obj
PolyFilter [ In reply to ]
On Apr 4, 2007, at 1:07 AM, Chris Nandor wrote:

> There's something not quite right in the new code, looks like an
> off-by-one.

Thanks for the test and the solution. Your test provided the
template for this:

for my $lower ( 0 .. 17 ) {
for my $amount ( 0 .. 17 ) {
my $upper = $lower + $amount;
$bit_vec = KinoSearch::Util::BitVector->new;
$bit_vec->flip_range( $lower, $upper + 1 );
is_deeply( $bit_vec->to_arrayref, [ $lower .. $upper ],
"flip range $lower .. $upper" );
}
}

I think that should snag all the corner cases for Flip_Range. It
would be nice to have similar tests for all aspects of BitVector.
Maybe AND, OR, and XOR tests which combine sets with incrementing
starts, ends, and sizes, but with the addition of steps of two and
three, so we aren't always testing contiguous sets. However, I'm not
going to make that a priority. Hopefully we're done tweaking for now.

I did wind up implementing the change slightly differently, though
your code worked fine. The mistake arose because the progression
through the function wasn't logical enough. I think it's a little
clearer now:

void
BitVec_flip_range(BitVector *self, u32_t from_bit, u32_t to_bit)
{
u32_t first = from_bit;
u32_t last = to_bit - 1;

/* proceed only if we have bits to flip */
if (from_bit == to_bit)
return;

BITVEC_GROW(self, last);

/* flip partial bytes */
while (last % 8 != 0 && last > first) {
BitVec_flip(self, last);
last--;
}
while (first % 8 != 0 && first < last) {
BitVec_flip(self, first);
first++;
}

/* are first and last equal ? */
if (first == last) {
/* there's only one bit left to flip */
BitVec_flip(self, last);
}
/* they must be multiples of 8, then */
else {
const u32_t start_tick = first >> 3;
const u32_t limit_tick = last >> 3;
u8_t *bits = self->bits + start_tick;
u8_t *limit = self->bits + limit_tick;

/* last actually belongs to the following byte (e.g. 8, in
byte 2) */
BitVec_flip(self, last);

/* flip whole bytes */
while (bits < limit) {
*bits = ~(*bits);
bits++;
}
}
}

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/