Mailing List Archive: PolyFilter and Plans

PolyFilter and Plans

projects at pudge

Apr 2, 2007, 8:32 AM

Post #1 of 22 (5640 views)

PolyFilter looks pretty close to done. Do you need anything else for it?

And what remains to be done until the development release is
alpha/beta/release?

--
Chris Nandor pudge@pobox.com http://pudge.net/
Open Source Technology Group pudge@ostg.com http://ostg.com/

PolyFilter and Plans [ In reply to ]

marvin at rectangular

Apr 2, 2007, 9:32 AM

Post #2 of 22 (5572 views)

On Apr 2, 2007, at 8:23 AM, Chris Nandor wrote:

> PolyFilter looks pretty close to done. Do you need anything else
> for it?

Reply forthcoming...

> And what remains to be done until the development release is
> alpha/beta/release?

Here's the stuff still on the to-do list for 0.20_03

Mandatory:

* Fix Henry's bug.
* Implement Schema->sort_field, which will make it possible to
order document numbers within a segment according to the sort
order of a particular field.
* Document Searcher->set_prune_factor, once the previous item
is done. See <http://issues.apache.org/jira/browse/LUCENE-851>
for an overview of what's in the works.
* Finish porting Lucene's BooleanScorer2.
* Clean up the memory errors and leaks now occurring in Tally.
(Finishing the BooleanScorer2 port is a large part of this).
* Devel release prep:
o Run test suite under valgrind and fix any problems that
turn up.
o PerlTidy the Perl source code.
o Test the tarball on FreeBSD and Linux in addition to my
Mac OS X laptop.

Optional:

* Fix a problem with BitVector.
* Add a small optimization to Tokenizer and LCNormalizer to avoid
extra string copies.
* Change Highlighter API to allow multiple fields to follow
different specs. Basically, you'd call Highlighter->add_field
the way you call PolyFilter->add. (RT issue 25400.)
* Hide a few more private classes from search.cpan.org. The
META.yml exclusion list works, but search.cpan.org isn't
recursively excluding the contents of forbidden directories.
(This is probably a bug I should report to the powers that be
at search.cpan.org.)

Important, but won't make the cut:

* Fix memory leaks in PolyFilter and QueryFilter.
* KinoSearch::Simple.
* Set up a wiki on rectangular.com where I can put lists like
this one.

Pruning is mandatory because A) it's required by my main client and
B) in the course of implementing it, I may discover that we need an
index format change (though I don't think so now). Tally is
mandatory because it's hard to back out. The optional items are all
either easy or 90% done.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/

PolyFilter and Plans [ In reply to ]

mwexler at tias

Apr 2, 2007, 10:08 AM

Post #3 of 22 (5578 views)

Marvin Humphrey wrote:
>
> On Apr 2, 2007, at 8:23 AM, Chris Nandor wrote:
>
>> PolyFilter looks pretty close to done. Do you need anything else for
>> it?
>
> Reply forthcoming...
>
>> And what remains to be done until the development release is
>> alpha/beta/release?
>
> Here's the stuff still on the to-do list for 0.20_03
>
> Mandatory:
>
> * Fix Henry's bug.
> * Implement Schema->sort_field, which will make it possible to
> order document numbers within a segment according to the sort
> order of a particular field.
> * Document Searcher->set_prune_factor, once the previous item
> is done. See <http://issues.apache.org/jira/browse/LUCENE-851>
> for an overview of what's in the works.
> * Finish porting Lucene's BooleanScorer2.
> * Clean up the memory errors and leaks now occurring in Tally.
> (Finishing the BooleanScorer2 port is a large part of this).
> * Devel release prep:
> o Run test suite under valgrind and fix any problems that
> turn up.
> o PerlTidy the Perl source code.
> o Test the tarball on FreeBSD and Linux in addition to my
> Mac OS X laptop.
>
> Optional:
>
> * Fix a problem with BitVector.
> * Add a small optimization to Tokenizer and LCNormalizer to avoid
> extra string copies.
> * Change Highlighter API to allow multiple fields to follow
> different specs. Basically, you'd call Highlighter->add_field
> the way you call PolyFilter->add. (RT issue 25400.)
> * Hide a few more private classes from search.cpan.org. The
> META.yml exclusion list works, but search.cpan.org isn't
> recursively excluding the contents of forbidden directories.
> (This is probably a bug I should report to the powers that be
> at search.cpan.org.)
>
> Important, but won't make the cut:
>
> * Fix memory leaks in PolyFilter and QueryFilter.
> * KinoSearch::Simple.
> * Set up a wiki on rectangular.com where I can put lists like
> this one.
>
> Pruning is mandatory because A) it's required by my main client and B)
> in the course of implementing it, I may discover that we need an index
> format change (though I don't think so now). Tally is mandatory
> because it's hard to back out. The optional items are all either easy
> or 90% done.
What is this pruning. I don't remember previously seeing it on the list
(I'm not complaining, just curious).

>
> Marvin Humphrey
> Rectangular Research
> http://www.rectangular.com/
>
>
>
> _______________________________________________
> KinoSearch mailing list
> KinoSearch@rectangular.com
> http://www.rectangular.com/mailman/listinfo/kinosearch

PolyFilter and Plans [ In reply to ]

marvin at rectangular

Apr 2, 2007, 3:32 PM

Post #4 of 22 (5575 views)

Reply #2...

On Apr 2, 2007, at 8:23 AM, Chris Nandor wrote:

> PolyFilter looks pretty close to done. Do you need anything else
> for it?

The one thing remaining is that we need is to prevent the
accumulation of cached bit vectors as index readers get refreshed.
That's gonna be a real nasty memory leak if you're running in a
persistent environment. I'm willing to release a devel version like
that, but it definitely has to be fixed before 0.20 proper.

How would it be if we added a $reader->{refs} = [] element to
IndexReader, along with a Reader->add_ref method? Then we store a
real reference in IndexReader, and call weaken() on the one in
$filter->{cached_bits}.

We could also eliminate $filter->{cached_bits} altogether and store
the bits in the IndexReader itself, using a hash instead of an
array. I think you suggested something along those lines. We can
handle that like so...

# QueryFilter.pm, PolyFilter.pm
$reader->store_ref( $filter, $bits );
my $bits = $reader->fetch_ref($filter);

# IndexReader.pm
sub store_ref {
my ( $self, $key, $val ) = @_;
$self->{refs}{$key} = $val;
}
sub fetch_ref {
my ( $self, $key ) = @_;
return $self->{refs}{$key};
}

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/

PolyFilter and Plans [ In reply to ]

projects at pudge

Apr 2, 2007, 3:44 PM

Post #5 of 22 (5580 views)

At 15:23 -0700 2007.04.02, Marvin Humphrey wrote:
>The one thing remaining is that we need is to prevent the
>accumulation of cached bit vectors as index readers get refreshed.

Why won't they get destroyed automatically when the references to them
disappear? I am sure I am missing something obvious.

--
Chris Nandor pudge@pobox.com http://pudge.net/
Open Source Technology Group pudge@ostg.com http://ostg.com/

PolyFilter and Plans [ In reply to ]

marvin at rectangular

Apr 2, 2007, 4:00 PM

Post #6 of 22 (5582 views)

On Apr 2, 2007, at 3:35 PM, Chris Nandor wrote:

> Why won't they get destroyed automatically when the references to them
> disappear?

Here's pseudocode illustrating the leak:

my $query_filter = QueryFilter->new($query);

while (1) {
my $searcher = refresh_searcher();
$searcher->search(
query => 'foo',
filter => $query_filter,
);
}

On every loop iter, a new bitvector will end up in $query_filter->
{cached_bits}. None of them will be destroyed until $query_filter
gets destroyed.

You can defeat that in my example my moving the QueryFilter
constructor inside the loop, but aren't there going to be scenarios
under mod_perl/FastCGI where the QueryFilter object gets reused under
a theoretically endless series of periodically refreshed searchers?
You don't want to recreate the QueryFilter object with each search --
you wouldn't get any caching benefits.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/

PolyFilter and Plans [ In reply to ]

projects at pudge

Apr 2, 2007, 4:14 PM

Post #7 of 22 (5579 views)

At 15:51 -0700 2007.04.02, Marvin Humphrey wrote:
>On every loop iter, a new bitvector will end up in $query_filter->
>{cached_bits}. None of them will be destroyed until $query_filter
>gets destroyed.

Oh, right ... I was for some reason thinking the reader was caching the
bitvector. Duh. I think we had talked about that at one point, but for
whatever reason, in my mind I was thinking that when the reader went away,
it would take the bitvectors with it. Obviously not the case.

I like your second suggestion; it probably *should* be in the reader, so we
don't have to manually handle destruction.

But while we are on the subject, I don't really like using objects as hash
keys, because I don't know that it's necessarily reliably unique. Over
time in a persistent, we could see the same object name/address come up
again, no?

So I used $self->{cached_bits}{$reader}, because I couldn't think of a
simple alternative, and your suggested code does something similar with
$self->{refs}{$key}, but I wonder if you have any ideas for a unique
identifier, unless you think the stringified object name is unique enough.
Maybe even a simple global counter, I dunno.

Regardless, I like the suggestion, and I'll move forward with it, if you like.

--
Chris Nandor pudge@pobox.com http://pudge.net/
Open Source Technology Group pudge@ostg.com http://ostg.com/

PolyFilter and Plans [ In reply to ]

marvin at rectangular

Apr 2, 2007, 6:16 PM

Post #8 of 22 (5561 views)

On Apr 2, 2007, at 4:06 PM, Chris Nandor wrote:

> But while we are on the subject, I don't really like using objects
> as hash
> keys, because I don't know that it's necessarily reliably unique.
> Over
> time in a persistent, we could see the same object name/address
> come up
> again, no?

Yeah. It's actually quite likely. We'll need to fix that.

> So I used $self->{cached_bits}{$reader}, because I couldn't think of a
> simple alternative, and your suggested code does something similar
> with
> $self->{refs}{$key}, but I wonder if you have any ideas for a unique
> identifier, unless you think the stringified object name is unique
> enough.

To do this right, we'd ape Lucene and create hash_code() methods for
each Query subclass and each component that goes into each of them.
The idea is that objects which are "equal" return the same 32-bit
integer value for hash_code. We can thus key the cache storage in
IndexReader off of QueryFilter->hash_code and PolyFilter->hash_code.

hash_code() is already implemented for ByteBuf, KinoSearch's main
string class, using the exact hashing algorithm that Perl uses for
strings. Because of that, the Lucene implementation for Term.hashCode
()...

/** Combines the hashCode() of the field and the text. */
public final int hashCode() {
return field.hashCode() + text.hashCode();
}

... is trivial to port to KinoSearch C code:

i32_t
Term_hash_code(Term *self)
{
return BB_Hash_Code(self->field) + BB_Hash_Code(self->text);
}

We don't even need an XS hook, because the way I designed the
KinoSearch's OO system for C classes, they all inherit from Obj, and
the XS hook for Obj->hash_code will invoke the correct routine
polymorphically.

However, we do need tests. :)

> Maybe even a simple global counter, I dunno.

I'd consider that an acceptable band-aid, though I don't think it's
that hard to do things right, and adding hash_code for all Query
classes has other benefits, like making it easier to test
QueryParser. I've pasted a compendium of all the routines that need
porting below my sig.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/

---------------------------------------------------------------

TermQuery:

/** Returns a hash code value for this object.*/
public int hashCode() {
return Float.floatToIntBits(getBoost()) ^ term.hashCode();
}

PhraseQuery:

/** Returns a hash code value for this object.*/
public int hashCode() {
return Float.floatToIntBits(getBoost())
^ slop
^ terms.hashCode()
^ positions.hashCode();
}

BooleanQuery:

/** Returns a hash code value for this object.*/
public int hashCode() {
return Float.floatToIntBits(getBoost()) ^ clauses.hashCode()
+ getMinimumNumberShouldMatch();
}

BooleanClause:

public int hashCode() {
return query.hashCode()
^ (Occur.MUST.equals(occur)?1:0)
^ (Occur.MUST_NOT.equals(occur)?2:0);
}

QueryFilter:

public int hashCode() {
return query.hashCode() ^ 0x923F64B9;
}

RangeFilter:

/** Returns a hash code value for this object.*/
public int hashCode() {
int h = fieldName.hashCode();
h ^= lowerTerm != null ? lowerTerm.hashCode() : 0xB6ECE882;
h = (h << 1) | (h >>> 31); // rotate to distinguish lower
from upper
h ^= (upperTerm != null ? (upperTerm.hashCode()) : 0x91BEC2C2);
h ^= (includeLower ? 0xD484B933 : 0)
^ (includeUpper ? 0x6AE423AC : 0);
return h;
}
}

PolyFilter and Plans [ In reply to ]

projects at pudge

Apr 2, 2007, 6:31 PM

Post #9 of 22 (5572 views)

Cool, I'll look into it.

--
Chris Nandor pudge@pobox.com http://pudge.net/
Open Source Technology Group pudge@ostg.com http://ostg.com/

PolyFilter and Plans [ In reply to ]

projects at pudge

Apr 2, 2007, 8:01 PM

Post #10 of 22 (5581 views)

At 18:07 -0700 2007.04.02, Marvin Humphrey wrote:
Where should the code for all these go? I added Term_hash_code to Term.c
(did you want that added, or was it just an example?) and that seemed to
work fine. But there's no corresponding C code bits for the other classes.

Also, I wonder what PolyFilter's hash_code implementation should be.

And thinking more about this ... are the hash_code values meant to be
unique only for a given class? Could a RangeFilter and QueryFilter have
the same value?

--
Chris Nandor pudge@pobox.com http://pudge.net/
Open Source Technology Group pudge@ostg.com http://ostg.com/

PolyFilter and Plans [ In reply to ]

marvin at rectangular

Apr 2, 2007, 9:26 PM

Post #11 of 22 (5577 views)

On Apr 2, 2007, at 7:52 PM, Chris Nandor wrote:

> At 18:07 -0700 2007.04.02, Marvin Humphrey wrote:
> Where should the code for all these go?

They should go in the .pm files. If the Query subclasses were all C
classes, we'd implement hash_code in C, but as it is we'll implement
them in Perl.

For background, KinoSearch::Util::Obj is loosely based on
java.lang.Object.

[ Note: read this whole email before following the links ]

http://java.sun.com/j2se/1.4.2/docs/api/java/lang/Object.html

Obj_Hash_Code, Obj_To_String, Obj_Clone, and Obj_Equals are all
roughly similar to their Java analogues. Obj_Destroy is sort of
similar to Object.finalize(), but KinoSearch triggers Obj_Destroy
using reference counting (like Perl), whereas java uses garbage
collection.

> I added Term_hash_code to Term.c
> (did you want that added, or was it just an example?)

We definitely need that.

* To generate TermQuery->hash_code, we need Term->hash_code.
* To generate Term->hash_code, we need ByteBuf->hash_code.
* ByteBuf->hash_code is done.

> Also, I wonder what PolyFilter's hash_code implementation should be.

Yeah, we'll have to make that up.
>
> And thinking more about this ... are the hash_code values meant to be
> unique only for a given class? Could a RangeFilter and QueryFilter
> have
> the same value?

Statistically rare, but inevitable given enough chances, I would
imagine. Hmm. This is a good point.

[ ... Marvin ponders ... ]

And it may drive us back to using weak references.

Crap.

Java's HashTable, HashSet, and HashMap classes are keyed by Objects,
rather than strings like Perl hashes. When you store a key in a
HashTable, it checks the object.hashCode(). If there's a collision,
it checks object.Equals(). If the objects don't test equal, then
this store op won't displace the existing key.

(Excellent but lengthy explanation at
<http://www-128.ibm.com/developerworks/java/library/j-jtp05273.html>.)

If we use hash_code as a key in a Perl hash, we're only fulfilling
half of the contract, because we can't continue on to test Obj_Equals
if we get a collision. All we have is the stringified version of the
Filter's hash_code, not a reference back to the object.

For reference, here's Lucene's TermQuery.equals() method:

public boolean equals(Object o) {
if (!(o instanceof TermQuery))
return false;
TermQuery other = (TermQuery)o;
return (this.getBoost() == other.getBoost())
&& this.term.equals(other.term);
}

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/

PolyFilter and Plans [ In reply to ]

projects at pudge

Apr 2, 2007, 9:43 PM

Post #12 of 22 (5577 views)

At 21:17 -0700 2007.04.02, Marvin Humphrey wrote:
>> And thinking more about this ... are the hash_code values meant to be
>> unique only for a given class? Could a RangeFilter and QueryFilter
>> have
>> the same value?
>
>Statistically rare, but inevitable given enough chances, I would
>imagine. Hmm. This is a good point.
>
>[ ... Marvin ponders ... ]
>
>And it may drive us back to using weak references.

I was thinking, we could just join the hash_code to the filter class type,
but on the other hand, we could get the same hash_code for the SAME class
type, too.

I'd still prefer a global counter to weak references, I think. Could be
simple ... each time a filter object is created, it gets an integer from
the counter, incrementing it.

>If we use hash_code as a key in a Perl hash, we're only fulfilling
>half of the contract, because we can't continue on to test Obj_Equals
>if we get a collision. All we have is the stringified version of the
>Filter's hash_code, not a reference back to the object.

This would let us uniquely identify each filter object, although, it
wouldn't tell us if the filter has been modified. Maybe ... a combination
of both?

sub store_filter_bits {
my ( $self, $filter, $bits ) = @_;
$self->{filters}{$filter->unique_id}{$filter->hash_code} = $bits;
}

I'd probably also like to see other existing keys in {$filter->unique_id}
deleted so they do not accumulate (if hash_code has changed for a given
unique_id, then the other hash_code values should be invalid).

Am I missing anything?

--
Chris Nandor pudge@pobox.com http://pudge.net/
Open Source Technology Group pudge@ostg.com http://ostg.com/

PolyFilter and Plans [ In reply to ]

marvin at rectangular

Apr 2, 2007, 10:29 PM

Post #13 of 22 (5571 views)

On Apr 2, 2007, at 9:34 PM, Chris Nandor wrote:

> I was thinking, we could just join the hash_code to the filter
> class type,
> but on the other hand, we could get the same hash_code for the SAME
> class
> type, too.

Yes, but that would be acceptable. If two QueryFilters use the exact
same Query and are invoked against the exact same IndexReader,
they'll produce identical BitVectors. You might as well just use
one. You might end up with a little extra cache thrash, tho.

> I'd still prefer a global counter to weak references, I think.

Nat a big fan of those, eh?

> Could be simple ... each time a filter object is created, it gets an
> integer from the counter, incrementing it.

It's so stupid, it just might work!

>> If we use hash_code as a key in a Perl hash, we're only fulfilling
>> half of the contract, because we can't continue on to test Obj_Equals
>> if we get a collision. All we have is the stringified version of the
>> Filter's hash_code, not a reference back to the object.
>
> This would let us uniquely identify each filter object, although, it
> wouldn't tell us if the filter has been modified.

True. That's a criticism of the Java hashing model -- it can
misbehave when keys are mutable and equals() returns different
results based on the state of the object.

In fact, this mutability problem is why I implemented
KinoSearch::Util::Hash to be more like Perl hashes -- it owns its own
keys, which are all ByteBufs. (Actually Perl hashes can use shared
strings which Hash can't, but the point is that in both cases the
keys aren't mutable.)

> Maybe ... a combination
> of both?
>
> sub store_filter_bits {
> my ( $self, $filter, $bits ) = @_;
> $self->{filters}{$filter->unique_id}{$filter->hash_code} = $bits;
> }
>
> I'd probably also like to see other existing keys in {$filter-
> >unique_id}
> deleted so they do not accumulate (if hash_code has changed for a
> given
> unique_id, then the other hash_code values should be invalid).
>
> Am I missing anything?

I distrust elaborate, patchy schemes like this. Too easy to spawn
bugs, too hard to test.

I'm game for the counter.

I'd suggest using class methods rather than a globally visible var --
that way we can track usage in a more controlled manner. Sound good?

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/

PolyFilter and Plans [ In reply to ]

projects at pudge

Apr 2, 2007, 11:53 PM

Post #14 of 22 (5578 views)

At 22:21 -0700 2007.04.02, Marvin Humphrey wrote:
>I distrust elaborate, patchy schemes like this. Too easy to spawn
>bugs, too hard to test.
>
>I'm game for the counter.

Well, the reason I suggest using the counter AND hash_code is that
hash_code will tell us if the filter has been modified, and the counter
won't.

But now (very late at night while my brain is not working so well), I
wonder, which filters can, or are likely to, change? PolyFilters, of
course. Any others?

And hash_code + class sounds like it could do the trick, but then we could
end up with leaks if filters change a lot.

Maybe if the filter changes, we increment its counter ID ... ? I still
worry a bit about accumulating the cached bitvectors in the reader, but I
suppose that is far less of a problem than the other way around ...

I am going to (literally) sleep on this.

>I'd suggest using class methods rather than a globally visible var --
>that way we can track usage in a more controlled manner. Sound good?

Yes.

--
Chris Nandor pudge@pobox.com http://pudge.net/
Open Source Technology Group pudge@ostg.com http://ostg.com/

PolyFilter and Plans [ In reply to ]

henka at cityweb

Apr 3, 2007, 12:01 AM

Post #15 of 22 (5570 views)

On Mon, 2 Apr 2007, Marvin Humphrey wrote:
> Pruning is mandatory because A) it's required by my main client and
> B) in the course of implementing it, I may discover that we need an
> index format change (though I don't think so now). Tally is
> mandatory because it's hard to back out. The optional items are all
> either easy or 90% done.

!Kudos! to Marvin, Chris and everyone else who is contributing - the next
0.20_ release is looking like a fine piece of work!

Regards
Henry

PolyFilter and Plans [ In reply to ]

projects at pudge

Apr 3, 2007, 9:19 AM

Post #16 of 22 (5571 views)

OK, so I want to take a step back and lay my thoughts out on this caching
stuff.

We need to cache the BitVector of a Filter on a per-reader basis. That
BitVector should be destroyed when the Reader is.

A Filter generally does not change, but it can, particularly in the case of
a PolyFilter. When the Filter changes, the BitVector should no longer be
used, and, ideally, it should be cleaned up, as it is not likely to be used
again.

That's basically it. Then there's implementation.

Uniquely identifying a particular Filter can be done with Filter class +
hash_code. This holds true even if the Filter is modified, as hash_code
will change. [.NB: we still need a way to generate hash_code for
PolyFilter.]

Caching by Filter class and hash_code does not allow us to destroy old
cached BitVectors if the Filter has changed. One way to perhaps address
this is with a GUID for the Filter; we can check to see if there is an old
BitVector cached for that GUID, and if so, destroy it.

I cannot think right now of any other possible way to destroy old
BitVectors for modified Filters in this implementation, because if you
cannot uniquely identify the Filter, you can't know that its BitVector is
expired, if the hash_code has changed (we can't check on modification of
the Filter, because we may not know the Reader at that point, such as in
PolyFilter->add). We either need to use a GUID, come up with another way,
or accept that we may have a leak here.

We could also use weak references. I generally think weak refs are a hack,
but then again, the above is also turning into a hack. :) We would no
longer need hash_code, or unique identification. The BitVector would be
there in the Filter and we could use it, up until the Reader disappeared,
in which case it would disappear; and we could prevent BitVectors from
accumulating by manually calling some dispose() method on the BitVector
when the Filter is modified:

# if Filter is modified
$_->dispose for values %{ $self->{cached_bits} };
$self->{cached_bits} = {};

However, apart from creating a dispose() method, we would also need to
uniquely identify the *Reader*, instead of the Filter, unless we think the
current method of using a stringified reference is sufficient.

So Marvin, at this point, I have two questions:

* Is the above description essentially correct? Am I missing anything?
* Which method do you prefer (including additional options)?

--
Chris Nandor pudge@pobox.com http://pudge.net/
Open Source Technology Group pudge@ostg.com http://ostg.com/

PolyFilter and Plans [ In reply to ]

marvin at rectangular

Apr 3, 2007, 2:34 PM

Post #17 of 22 (5571 views)

On Apr 3, 2007, at 9:10 AM, Chris Nandor wrote:

> We need to cache the BitVector of a Filter on a per-reader basis.
> That
> BitVector should be destroyed when the Reader is.
>
> A Filter generally does not change, but it can, particularly in the
> case of
> a PolyFilter. When the Filter changes, the BitVector should no
> longer be
> used, and, ideally, it should be cleaned up, as it is not likely to
> be used
> again.
>
>
> That's basically it. Then there's implementation.

Well stated.

> [.NB: we still need a way to generate hash_code for PolyFilter.]

That could be as simple as adding the hashcodes of all component
filters, plus some extra factor for each logic.

> We could also use weak references. I generally think weak refs are
> a hack,
> but then again, the above is also turning into a hack. :)

Weak refs are a hack, but sometimes they enable elegant solutions.
Tell me what you think of the following, which doesn't even involve
IndexReader:

# Filter.pm

# Store a cached BitVector associated with a particular reader.
Store a weak
# reference to the Reader as an indicator of cache validity.
sub store_cached_bits {
my ( $self, $reader, $bits ) = @_;
my $pair = { reader => $reader, bits => $bits };
weaken( $pair->{reader} );
$self->{cached_bits}{$reader} = $pair;
}

# Retrived a cached BitVector associated with a particular reader.
As a side
# effect, clear away any BitVectors which are no longer valid because
their
# readers have gone away.
sub fetch_cached_bits {
my ( $self, $reader ) = @_;
my $cached_bits = $self->{cached_bits};

# sweep
while ( my ( $stringified, $pair ) = each %$cached_bits ) {
# if weak ref has decomposed into undef, reader is gone...
so delete
next if defined $pair->{reader};
delete $cached_bits->{$stringified};
}

# fetch
my $pair = $cached_bits->{$reader};
return $pair->{bits} if defined $pair;
return;
}

> However, apart from creating a dispose() method, we would also need to
> uniquely identify the *Reader*, instead of the Filter, unless we
> think the
> current method of using a stringified reference is sufficient.

Stringification is sufficient for the above, because the undef-ness
of the decomposed weak reference protects us against a second reader
impersonating one that's dead and gone.

FWIW, we might consider adding this to KinoSearch::Util::Class:

sub hash_code { refaddr(shift) }

Then we could use $reader->hash_code in place of "$reader", above.
But it doesn't really matter for this particular application.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/

PolyFilter and Plans [ In reply to ]

projects at pudge

Apr 3, 2007, 2:43 PM

Post #18 of 22 (5578 views)

At 14:25 -0700 2007.04.03, Marvin Humphrey wrote:
>Stringification is sufficient for the above, because the undef-ness
>of the decomposed weak reference protects us against a second reader
>impersonating one that's dead and gone.

Yeah, and we won't need a dispose() this way, too.

I think we've found a winner.

>FWIW, we might consider adding this to KinoSearch::Util::Class:
>
> sub hash_code { refaddr(shift) }
>
>Then we could use $reader->hash_code in place of "$reader", above.
>But it doesn't really matter for this particular application.

Yeah ... and my only concern is that calling it hash_code might break the
notion that the hash_code changes if the underlying object changes. But
maybe that concern is not warranted, I don't know.

So I'll go ahead and do all of the above and look into some tests for it,
if you like.

--
Chris Nandor pudge@pobox.com http://pudge.net/
Open Source Technology Group pudge@ostg.com http://ostg.com/

PolyFilter and Plans [ In reply to ]

marvin at rectangular

Apr 3, 2007, 3:07 PM

Post #19 of 22 (5580 views)

On Apr 3, 2007, at 2:34 PM, Chris Nandor wrote:

> I think we've found a winner.

Excellent.

>> FWIW, we might consider adding this to KinoSearch::Util::Class:
>>
>> sub hash_code { refaddr(shift) }
>>
>> Then we could use $reader->hash_code in place of "$reader", above.
>> But it doesn't really matter for this particular application.
>
> Yeah ... and my only concern is that calling it hash_code might
> break the
> notion that the hash_code changes if the underlying object
> changes. But
> maybe that concern is not warranted, I don't know.

That's not part of the Java hashCode contract, which KS is trying to
emulate. It's OK for multiple objects to return the same value for
hash_code. Here's an excerpt from that IBM article...

One compatible, but not all that useful, way
to define hashCode() is like this:

public int hashCode() { return 0; }

This approach will yield horrible performance
for HashMaps with a large number of entries,
but it does conform to the specification.

> So I'll go ahead and do all of the above and look into some tests
> for it,
> if you like.

That would be much appreciated. :)

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/

PolyFilter and Plans [ In reply to ]

projects at pudge

Apr 3, 2007, 3:48 PM

Post #20 of 22 (5581 views)

At 14:59 -0700 2007.04.03, Marvin Humphrey wrote:
>That's not part of the Java hashCode contract, which KS is trying to
>emulate. It's OK for multiple objects to return the same value for
>hash_code.

Right, but what I was thinking was the same value for the SAME object, when
that object has been modified. But I suppose if it is important for a
given object to return a different value, then it can implement its own
hash_code. So I'll add that too, and then hash_code can be added elsewhere
as needed ...

--
Chris Nandor pudge@pobox.com http://pudge.net/
Open Source Technology Group pudge@ostg.com http://ostg.com/

PolyFilter and Plans [ In reply to ]

projects at pudge

Apr 4, 2007, 1:18 AM

Post #21 of 22 (5573 views)

Here's a patch. For query_filter.t, I check cached_bits directly for the
proper count of entries with a defined reader, to see if they are undef'd
when the reader goes out of scope.

I added hash_code and a simple test for it; not sure if it is best to check
it the way I did, but there's not much else to do.

A few little cleanups here and there.

--
Chris Nandor pudge@pobox.com http://pudge.net/
Open Source Technology Group pudge@ostg.com http://ostg.com/
-------------- next part --------------
A non-text attachment was scrubbed...
Name: patch-20070403.txt
Type: application/octet-stream
Size: 12422 bytes
Desc: not available
Url : http://www.rectangular.com/pipermail/kinosearch/attachments/20070404/6cfbdbb8/patch-20070403.obj

PolyFilter and Plans [ In reply to ]

marvin at rectangular

Apr 4, 2007, 5:36 PM

Post #22 of 22 (5574 views)

On Apr 4, 2007, at 1:09 AM, Chris Nandor wrote:

> Here's a patch.

Thanks! I've applied this (with trivial tweaks) as revision 2260.

> For query_filter.t, I check cached_bits directly for the
> proper count of entries with a defined reader, to see if they are
> undef'd
> when the reader goes out of scope.

Perfect.

> I added hash_code and a simple test for it; not sure if it is best
> to check
> it the way I did, but there's not much else to do.

Yeah. Looks fine.

> A few little cleanups here and there.

I added a couple of my own, like fixing my own typos in the comments
to fetch_cached_bits.

I see you added this to Class.pm:

use Scalar::Util 'refaddr';

That's not actually needed, because it's brought in by
KinoSearch::Util::ToolSet:

http://www.rectangular.com/kinosearch/docs/devel/KinoSearch/Util/
ToolSet.html

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/