Mailing List Archive

1 2  View All
_02: RangeFilter problems [ In reply to ]
On Mar 27, 2007, at 1:25 PM, Chris Nandor wrote:

> OK, sounds good re: tests. I think I basically have this all set.
> Not including the tests, here is what I have.

Thanks! Applied as revision 2221.

> It is all working fine, though RangeFilter->bits() is incomplete
> because of MatchFieldQuery.pm. I'll send the tests along later
> tonight, probably.

OK, I'll go work on MatchFieldQuery.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
_02: RangeFilter problems [ In reply to ]
On Mar 27, 2007, at 1:19 PM, Marvin Humphrey wrote:

>> Is it sufficient to run search() a few times and compare the
>> BitVector objects?
>
> Yeah, that'll work if the readers produce distinct result sets.
>
> Psuedocode...
>
> my $query_filter = QueryFilter->new($query);
>
> my $reader_a = IndexReader->open($invindex_a);
> my $reader_b = IndexReader->open($invindex_b);
>
> $searcher = Searcher->new( $reader_a );
> verify_results( $query_filter, $searcher, \@results_a );
> $searcher = Searcher->new( $reader_b );
> verify_results( $query_filter, $searcher, \@results_b );
> $searcher = Searcher->new( $reader_a );
> verify_results( $query_filter, $searcher, \@results_a );

This is probably better... simpler, and fewer variables in play...

my $query_filter = QueryFilter->new($query);

my $reader_a = IndexReader->open($invindex_a);
my $reader_b = IndexReader->open($invindex_b);

my $bits_a = $query_filter->bits($reader_a);
my $bits_b = $query_filter->bits($reader_b);
my $bits_a_copy = $query_filter->bits($reader_a);
my $bits_b_copy = $query_filter->bits($reader_b);

ok( $bits_a->equals($bits_a_copy), "cache by reader 1" );
ok( $bits_b->equals($bits_b_copy), "cache by reader 2" );
ok( !$bits_a->equals($bits_b), "cache by reader 3" );

The default implementation of equals(), inherited by BitVector from
Obj, compares the pointer values for the objects themselves.

bool_t
Obj_equals(Obj *self, Obj *other)
{
return (self == other);
}


Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
_02: RangeFilter problems [ In reply to ]
At 16:24 -0700 2007.03.27, Marvin Humphrey wrote:
>This is probably better... simpler, and fewer variables in play...

Yeah, that is more along the lines of what I was working on. My hacky test
while working on the code was just to compare the BitVector objects as
stringifed scalars, but this is better. :-)

Here's a new test file for you.

--
Chris Nandor pudge@pobox.com http://pudge.net/
Open Source Technology Group pudge@ostg.com http://ostg.com/
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 507-query_filter.t
Type: application/octet-stream
Size: 3334 bytes
Desc: not available
Url : http://www.rectangular.com/pipermail/kinosearch/attachments/20070327/0a55ec26/507-query_filter-0001.obj
_02: RangeFilter problems [ In reply to ]
Oops, I forgot to make it use the bits() method when I was cleaning it up
... here you go.

--
Chris Nandor pudge@pobox.com http://pudge.net/
Open Source Technology Group pudge@ostg.com http://ostg.com/
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 507-query_filter.t
Type: application/octet-stream
Size: 3137 bytes
Desc: not available
Url : http://www.rectangular.com/pipermail/kinosearch/attachments/20070327/de941626/507-query_filter.obj
_02: RangeFilter problems [ In reply to ]
On Mar 27, 2007, at 7:31 PM, Chris Nandor wrote:

> ... here you go.

Very thorough! Looks like you've made up the test so that it guards
against incorrect caching regardless of whether we cache in the
IndexReader object or the QueryFilter object.

I think it might actually go a little farther than it needs to.
QueryFilter objects that are created with identical queries will
always return identical bit sets when supplied with a given reader.
So I don't think we need to test to make sure that distinct
QueryFilters return distinct BitVector objects.

The only danger would be that two different QueryFilters constructed
with different queries give you the wrong cached bits -- so you
filter out "bar" when you meant to filter out "foo". That would be
bad, but it's also far-fetched. I think we might consider
simplifying the tests so that there are fewer dimensions and the test
file becomes a little easier to follow.

What would you say to cutting down the number of QueryFilter objects
from 2 to 1? It would still be a marked improvement over the one,
lonely test that file had before. :)

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
_02: RangeFilter problems [ In reply to ]
On Mar 27, 2007, at 2:14 PM, Marvin Humphrey wrote:

> OK, I'll go work on MatchFieldQuery.

This is done. It will remain a private class for now. I'm not
satisfied with its behavior or its API, but it will work fine as a
cog in the PolyFilter machine.

The PolyFilter API outlined earlier, though, that's fine. PolyFilter
will go public.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
_02: RangeFilter problems [ In reply to ]
At 0:31 -0700 2007.03.28, Marvin Humphrey wrote:
>So I don't think we need to test to make sure that distinct
>QueryFilters return distinct BitVector objects.

This is just to make sure that caching is working right, that it is not
being stored somehow in a way that we will get back the wrong BitVector
object.


>The only danger would be that two different QueryFilters constructed
>with different queries give you the wrong cached bits -- so you
>filter out "bar" when you meant to filter out "foo". That would be
>bad, but it's also far-fetched.

With the current implementation, yeah, but if caching method changes ... I
don't mind removing it, I just figured it didn't hurt anything, and I'd
rather test than not test if it's not a big deal.


>I think we might consider
>simplifying the tests so that there are fewer dimensions and the test
>file becomes a little easier to follow.

Yeah, it is a little complicated to follow. :-)

--
Chris Nandor pudge@pobox.com http://pudge.net/
Open Source Technology Group pudge@ostg.com http://ostg.com/
_02: RangeFilter problems [ In reply to ]
Oh, and other than this test, what else can I do? I have the patch for
RangeFilter and MatchFieldQuery, though not sure if it can be tested until
PolyFilter is ready:

Index: lib/KinoSearch/Search/RangeFilter.pm
===================================================================
--- lib/KinoSearch/Search/RangeFilter.pm (revision 2229)
+++ lib/KinoSearch/Search/RangeFilter.pm (working copy)
@@ -33,12 +33,11 @@
# collect docs that have a value for this field which passes the filter
my $collector = KinoSearch::Search::HitCollector->new_bit_coll;
my $searcher = KinoSearch::Searcher->new( reader => $reader );
-# XXX not working yet
-# my $query = KinoSearch::Search::MatchFieldQuery->new(
-# field => $self->{field} );
+ my $query = KinoSearch::Search::MatchFieldQuery->new(
+ field => $self->{field} );

$searcher->collect(
-# query => $query,
+ query => $query,
filter => $self,
collector => $collector,
);

--
Chris Nandor pudge@pobox.com http://pudge.net/
Open Source Technology Group pudge@ostg.com http://ostg.com/
_02: RangeFilter problems [ In reply to ]
On Mar 29, 2007, at 10:50 AM, Chris Nandor wrote:

>> The only danger would be that two different QueryFilters constructed
>> with different queries give you the wrong cached bits -- so you
>> filter out "bar" when you meant to filter out "foo". That would be
>> bad, but it's also far-fetched.
>
> With the current implementation, yeah, but if caching method
> changes ...

Fair enough.

> I don't mind removing it, I just figured it didn't hurt anything, and

It does hurt something, though: testing this one aspect, at least in
this way, substantially increases the difficulty of maintaining the
test file.

Some tests are less useful than others, and they all have to justify
the costs of writing and maintaining them. Being a utilitarian by
disposition, I tend to see chasing complete test coverage down to the
last "or die" as an absurd, fetishistic waste of resources -- but
also recognize that reasonably thorough test suites pay surprisingly
large dividends in the long run.

I agree that it's better to test this aspect than not to; I would
just like to hold down the maintenance costs. I may go in and
refactor a bit, in an attempt to uphold the spirit of your test while
not sacrificing so much clarity.

> I'd rather test than not test if it's not a big deal.

OK, you make a good case, and I've committed the file as revision
2230. Thanks very much for the contribution. :)

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
_02: RangeFilter problems [ In reply to ]
On Mar 29, 2007, at 11:49 AM, Chris Nandor wrote:

> Oh, and other than this test, what else can I do?

We can, and should, test RangeFilter->bits in isolation. BitVector
has a handy to_arrayref() method which was put there for just such a
situation -- it returns one number for each set bit, so you can
verify that the expected bits have been set.

> I have the patch for
> RangeFilter and MatchFieldQuery, though not sure if it can be
> tested until
> PolyFilter is ready:

I'll commit this patch when we have a test for it.

As for PolyFilter, would you like to try writing it? It will be a
pure-perl class. I have a lot of stuff to do that nobody else can
handle easily; PolyFilter isn't like that. Right now, I'm deep
inside the guts of TermScorer.

If you tackle PolyFilter, I'll prioritize BitVector->AND, BitVector-
>OR, and BitVector->XOR.

Following Damian's "design by coding" mantra, here's how using
PolyFilter should work:

my $polyfilter = KinoSearch::Search::PolyFilter->new;
$polyfilter->add( filter => $query_filter, logic => 'AND' );
$polyfilter->add( filter => $range_filter, logic => 'AND' );
my $hits = $searcher->search( query => $query, filter =>
$polyfilter );

Internally, it will need a bits() method, which uses a caching
mechanism identical to that used by QueryFilter.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
_02: RangeFilter problems [ In reply to ]
in QueryFilter::bits():

$searcher->collect(
query => $self->{query},
collector => $collector,
);

Should that also pass filter => $self ?

--
Chris Nandor pudge@pobox.com http://pudge.net/
Open Source Technology Group pudge@ostg.com http://ostg.com/
_02: RangeFilter problems [ In reply to ]
At 12:11 -0700 2007.03.29, Marvin Humphrey wrote:
>We can, and should, test RangeFilter->bits in isolation. BitVector
>has a handy to_arrayref() method which was put there for just such a
>situation -- it returns one number for each set bit, so you can
>verify that the expected bits have been set.

I've just about got these tests added to 512-range_filter.t.


>As for PolyFilter, would you like to try writing it?

I think so. And maybe at that point, I'll start a new subject line!

--
Chris Nandor pudge@pobox.com http://pudge.net/
Open Source Technology Group pudge@ostg.com http://ostg.com/
_02: RangeFilter problems [ In reply to ]
At 12:57 -0700 2007.03.27, Marvin Humphrey wrote:
>> Does RangeFilter's to-be-new bits() method need the same caching?
>
>There's actually some caching going on with regards to RangeFilter,
>but it's already stored in the IndexReader. :)

Looking into this, though, it looks like the BitVector value in RangeFilter
still needs to be cached. You gave me this code in the initial e-mail:

sub bits {
my ( $self, $reader ) = @_;

# collect docs that have a value for this field which passes the
filter
my $collector = KinoSearch::Search::HitCollector->new_bit_coll;
my $searcher = KinoSearch::Searcher->new( reader => $reader );
my $query = KinoSearch::Search::MatchFieldQuery->new(
field => $self->{field},
);
$searcher->collect(
query => $query,
filter => $self,
collector => $collector,
);

return $collector->get_bit_vector;
}

But when I call new_bit_coll without a value, I get an error. It wants a
BitVector value. Even if it didn't and created one on its own somehow,
wouldn't it be more efficient to cache it?

Was I reading too much into what you wrote above? I assumed you meant
bits() didn't need to be cached.

--
Chris Nandor pudge@pobox.com http://pudge.net/
Open Source Technology Group pudge@ostg.com http://ostg.com/
_02: RangeFilter problems [ In reply to ]
On Mar 29, 2007, at 2:28 PM, Chris Nandor wrote:

> in QueryFilter::bits():
>
> $searcher->collect(
> query => $self->{query},
> collector => $collector,
> );
>
> Should that also pass filter => $self ?

No, it shouldn't.

* Every hit that matches the query will be collected by $collector
* Every hit that collector collects will cause a bit to be set
* At the end, you have a bit vector with a set bit for every hit

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
_02: RangeFilter problems [ In reply to ]
On Mar 29, 2007, at 2:35 PM, Chris Nandor wrote:

> But when I call new_bit_coll without a value, I get an error. It
> wants a
> BitVector value.

My untested code contained an error. We'll need to create a
BitVector. It should have a capacity of Reader->max_doc.

my $bit_vec = KinoSearch::Util::BitVector->new(
capacity => $reader->max_doc );
my $collector = KinoSearch::Search::HitCollector->new_bit_coll(
bit_vector => $bit_vec );

> Even if it didn't and created one on its own somehow,
> wouldn't it be more efficient to cache it?

You're right, it would be more efficient -- but that efficiency
wouldn't be exploited very often. In general, the RangeFilter
operates without it. If you do this...

my $hits = $searcher->search(
query => 'foo',
filter => $range_filter,
);

... KS doesn't call RangeFilter->bits.

As of now, RangeFilter->bits will only be called internally by
PolyFilter. PolyFilter will do its own caching, so any one
PolyFilter won't need to keep calling RangeFilter->bits.

There would be added efficiency if you created several PolyFilters
using the same RangeFilter, but that seems esoteric.

> Was I reading too much into what you wrote above? I assumed you meant
> bits() didn't need to be cached.

That's what I meant. It's a judgement call. I think it's better not
to consume memory. Especially since the call to RangeFilter->bits is
going to be blazing fast -- it uses an existing sort_cache and
there's no disk acess needed.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
_02: RangeFilter problems [ In reply to ]
At 14:58 -0700 2007.03.29, Marvin Humphrey wrote:
>My untested code contained an error. We'll need to create a
>BitVector. It should have a capacity of Reader->max_doc.

Yeah, that's what I had. I just wasn't sure a. if that was intended (I
figured it was, else new_bit_coll could not know capacity, if it were to
create a BitVector on its own), and b. if you wanted caching.


>> Was I reading too much into what you wrote above? I assumed you meant
>> bits() didn't need to be cached.
>
>That's what I meant. It's a judgement call. I think it's better not
>to consume memory. Especially since the call to RangeFilter->bits is
>going to be blazing fast -- it uses an existing sort_cache and
>there's no disk acess needed.

OK. Patches included for RangeFilter.pm and t/512 (and one housekeeping
patch for QueryFilter).

And work has begun on PolyFilter ...

--
Chris Nandor pudge@pobox.com http://pudge.net/
Open Source Technology Group pudge@ostg.com http://ostg.com/
-------------- next part --------------
A non-text attachment was scrubbed...
Name: patch-20070329.txt
Type: application/octet-stream
Size: 6323 bytes
Desc: not available
Url : http://www.rectangular.com/pipermail/kinosearch/attachments/20070329/125561c1/patch-20070329.obj
_02: RangeFilter problems [ In reply to ]
On Mar 29, 2007, at 2:35 PM, Chris Nandor wrote:

> At 12:11 -0700 2007.03.29, Marvin Humphrey wrote:
>> We can, and should, test RangeFilter->bits in isolation. BitVector
>> has a handy to_arrayref() method which was put there for just such a
>> situation -- it returns one number for each set bit, so you can
>> verify that the expected bits have been set.
>
> I've just about got these tests added to 512-range_filter.t.

Fab.

>> As for PolyFilter, would you like to try writing it?
>
> I think so. And maybe at that point, I'll start a new subject line!

Double fab. :)

We should consider creating an abstract Filter class which
RangeFilter, QueryFilter, and PolyFilter would inherit from.

I also think it's time the __PACKAGE__->init_instance_vars stuff got
the heave-ho. I anticipate that you'll find it confusing. It has
turned out to be less useful and more annoying than I'd envisioned.

Just start PolyFilter off with this...

use strict;
use warnings;

package KinoSearch::Search::PolyFilter;
use KinoSearch::Util::ToolSet;
use base qw( KinoSearch::Util::Class );

our %instance_vars = ();

... and PolyFilter can inherit new() from KinoSearch::Util::Class.


Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
_02: RangeFilter problems [ In reply to ]
On Mar 29, 2007, at 3:07 PM, Chris Nandor wrote:

> Patches included for RangeFilter.pm and t/512

This looks great.

First I had to fix a memory glitch in MatchFieldScorer.c that your
patch revealed. You'll need to svn up.

After that, I applied your changes as revision 2232. Thanks!

> (and one housekeeping patch for QueryFilter).

This cracked me up. :)

Index: lib/KinoSearch/Search/QueryFilter.pm
===================================================================
--- lib/KinoSearch/Search/QueryFilter.pm (revision 2230)
+++ lib/KinoSearch/Search/QueryFilter.pm (working copy)
@@ -9,8 +9,6 @@
__PACKAGE__->init_instance_vars(
# constructor params / members
query => undef,
- # members
- cached_bits => undef,
);
}

I was actually going to make some minor changes to your
implementation, but I hadn't wanted to nitpick. ;)

The %instance_vars hash in each class serves two purposes.

First, labeled constructor args are checked against its members, so
that misspelled argument labels cause errors rather than obscure bugs.

Second, it serves as a template of default values from which hash-
based objects get cloned.

Since this has come up though, I've applied the patch below instead,
which is more consistent in relation to the rest of KS. (Except for
the fact that it doesn't use __PACKAGE__->init_instance_vars, but I'm
about to kill that off globally.)

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/


$ svn diff ..
Index: ../perl/lib/KinoSearch/Search/QueryFilter.pm
===================================================================
--- ../perl/lib/KinoSearch/Search/QueryFilter.pm (revision 2230)
+++ ../perl/lib/KinoSearch/Search/QueryFilter.pm (working copy)
@@ -5,15 +5,14 @@
use KinoSearch::Util::ToolSet;
use base qw( KinoSearch::Util::Class );

-BEGIN {
- __PACKAGE__->init_instance_vars(
- # constructor params / members
- query => undef,
- # members
- cached_bits => undef,
- );
-}
+our %instance_vars = (
+ # constructor params / members
+ query => undef,

+ # members
+ cached_bits => {},
+);
+
use KinoSearch::Search::HitCollector;
use KinoSearch::Util::BitVector;

@@ -26,11 +25,11 @@
sub bits {
my ( $self, $reader ) = @_;

- my $cached_bits = $self->{$reader}{cached_bits};
+ my $cached_bits = $self->{cached_bits}{$reader};

# fill the cache
if ( !defined $cached_bits ) {
- $self->{$reader}{cached_bits} = $cached_bits =
KinoSearch::Util::BitVector->new(
+ $self->{cached_bits}{$reader} = $cached_bits =
KinoSearch::Util::BitVector->new(
capacity => $reader->max_doc );

my $collector = KinoSearch::Search::HitCollector-
>new_bit_coll(
_02: RangeFilter problems [ In reply to ]
At 17:43 -0700 2007.03.29, Marvin Humphrey wrote:
>+our %instance_vars = (
>+ # constructor params / members
>+ query => undef,
>
>+ # members
>+ cached_bits => {},
>+);
>+

>- my $cached_bits = $self->{$reader}{cached_bits};
>+ my $cached_bits = $self->{cached_bits}{$reader};

Gotcha.

--
Chris Nandor pudge@pobox.com http://pudge.net/
Open Source Technology Group pudge@ostg.com http://ostg.com/
_02: RangeFilter problems [ In reply to ]
At 14:58 -0700 2007.03.29, Marvin Humphrey wrote:
>> Even if it didn't and created one on its own somehow,
>> wouldn't it be more efficient to cache it?
>
>You're right, it would be more efficient -- but that efficiency
>wouldn't be exploited very often. In general, the RangeFilter
>operates without it. If you do this...
>
> my $hits = $searcher->search(
> query => 'foo',
> filter => $range_filter,
> );
>
>... KS doesn't call RangeFilter->bits.
>
>As of now, RangeFilter->bits will only be called internally by
>PolyFilter. PolyFilter will do its own caching, so any one
>PolyFilter won't need to keep calling RangeFilter->bits.
>
>There would be added efficiency if you created several PolyFilters
>using the same RangeFilter, but that seems esoteric.

I was just thinking about this again, and another time RangeFilter->bits
gets called again is for the same PolyFilter, if you add more filters to
it; then it has to call bits() on each filter again.

Maybe that's similarly esoteric, but I figured I'd mention it.

--
Chris Nandor pudge@pobox.com http://pudge.net/
Open Source Technology Group pudge@ostg.com http://ostg.com/
_02: RangeFilter problems [ In reply to ]
On Apr 3, 2007, at 2:42 PM, Chris Nandor wrote:

> I was just thinking about this again, and another time RangeFilter-
> >bits
> gets called again is for the same PolyFilter, if you add more
> filters to
> it; then it has to call bits() on each filter again.

You're right, but I still don't think that warrants taking the memory
hit. I'm pretty sure RangeFilter->bits will be fast enough in context.

If someone comes up with a scenario where it makes a difference, we
can either change our minds and add it, or add a CachingWrapperFilter
class.

1 2  View All