On Mar 20, 2007, at 11:55 AM, Chris Nandor wrote:
> At 9:39 -0700 2007.03.15, Marvin Humphrey wrote:
>> On Mar 8, 2007, at 2:36 PM, Chris Nandor wrote:
>>> how do I combine my $range_filter with other filters? Is it
>>> possible?
>>
>> Not presently. I've been contemplating how to make this available
>> (i.e. procastinating) while working on a bunch of other problems.
>> The trick is how ranges should score.
>
> This is something we need pretty soon; is there anything I can do
> to help
> make it work?
Yes, there is.
QueryFilter needs to be changed to cache BitVector objects in a hash,
keyed per IndexReader. The bits() method should be changed to take
an IndexReader rather than a Searcher as an argument, and so should
make_collector(). Calls to those methods in the library and the test
suite need to be adjusted.
Tests need to be added to t/507-query_filter.t to ensure that...
* The caching mechanism works and we don't keep
generating new BitVectors.
* The correct BitVector is returned by the bits()
method (i.e. not one belonging to another IndexReader).
Ideally, destruction of the cached BitVectors held by a QueryFilter
object would be triggered when the IndexReader gets destroyed, since
they're no longer of any use after that. That's a little harder, and
may require some sort of stupid hack to store references to the
BitVectors in IndexReader along with calling weaken() on the refs
held by the QueryFilter object. The point is that we don't want to
accumulate BitVectors when the Searcher/Reader is being continually
refreshed.
RangeFilter also needs make_collector() changed to be keyed off of an
IndexReader. That will be straightforward, as the first thing
RangeFilter->make_collector does right now is call get_reader().
Tests and Library calls to the method need to be adjusted, but won't
need any changes to their substance.
RangeFilter then needs a bits() method added to it. It will probably
look like this...
sub bits {
my ( $self, $reader ) = @_;
# collect docs that have a value for this field which passes the
filter
my $collector = KinoSearch::Search::HitCollector->new_bit_coll;
my $searcher = KinoSearch::Searcher->new( reader => $reader );
my $query = KinoSearch::Search::MatchFieldQuery->new(
field => $self->{field},
);
$searcher->collect(
query => $query,
filter => $self,
collector => $collector,
);
return $collector->get_bit_vector;
}
Searcher->collect needs to be created, but that will basically be a
refactoring of Searcher->search_hit_collector which will be trivial
for me and hard for anyone else... so I'll handle that.
MatchFieldQuery (which will be nearly identical to TermQuery) also
needs to be written. Writing tests to ensure that a Searcher returns
correct results when supplied with a MatchFieldQuery will be pretty
straightforward and would be appreciated.
I'd love it if someone else wanted to get involved in writing
MatchFieldQuery itself, but such a person would need to be be willing
to absorb some information retrieval theory -- so I'll assume it will
be my sole responsibility (as will MatchFieldScorer) unless someone
expresses an interest.
Finally, we need to create PolyFilter. PolyFilter will have an add()
method which works like this:
$poly_filter->add(
filter => $filter,
logic => 'AND',
);
PolyFilter->bits() will call bits() on each of its sub-filters, then
it will combine the BitVectors together. Like QueryFilter, it will
cache filters per-IndexReader.
At present, BitVector only has a logical_and() method; if PolyFilter
is to be able to combine filters using OR, XOR, etc, the appropriate
methods need to be added to BitVector. This is deceptively
difficult. It involves classic C bit-twiddling, but has to be
maximally efficient, and there are a lot of nasty corner cases that
need tests. I'm assuming I'll be handling this one.
Still with me? ;)
I also ask that potential hackers agree contribute their code to
Apache. That way we can use it in Lucy without complication.
Marvin Humphrey
Rectangular Research
http://www.rectangular.com/