Mailing List Archive

Compacting the core
Greets,

(I'm cc'ing this to lucy-dev@lucene.apache.org, because I think Lucy
should follow the same design principles described in this post.)

KinoSearch is spinning off a few modules, to cut down on the core size
and complexity. For the present time, they will continue to be
distributed with the KinoSearch tarball, but eventually they will
become separate distributions.

KinoSearch::Search::SearchServer and KinoSearch::Search::SearchClient
have moved to KSx::Remote::SearchServer and
KSx::Remote::SearchClient. Eventually, they will be distributed under
KSx::Remote.

The rationale for breaking out SearchServer/SearchClient is that there
are many ways to have machines interconnect; the Socket/faked-up-rpc
approach taken by SearchClient/SearchServer, the XML approach used by
Solr, etc. For core, it is only crucial that the messages that have
to be sent over the network be serializable using *some* technique --
it's not important what technique is chosen.

The other spinoff is Filter. KinoSearch::Search::Filter,
KinoSearch::Search::QueryFilter, and KinoSearch::Search::PolyFilter
have all been removed; their functionality is now encapsulated in
KSx::Search::Filter, which has been refactored as a subclass of
Query. The last filter subclass, KinoSearch::Search::RangeFilter, has
been replaced by a new core class, KinoSearch::Search::RangeQuery
(which behaves similarly to Lucene's ConstantScoringRangeQuery with a
fixed score of 0).

The standard KS search methods no longer take a 'filter' argument.
Here's the new Filter API in action:

my %category_filters;
for my $category (qw( sweet sour salty bitter )) {
my $cat_query = KinoSearch::Search::TermQuery->new(
field => 'category',
term => $category,
);
$category_filters{$category} = KSx::Search::Filter->new(
query => $cat_query,
);
}

while ( my $cgi = CGI::Fast->new ) {
my $user_query = $cgi->param('q');
my $filter = $category_filters{$cgi->param('category')};
my $and_query = KinoSearch::Search::ANDQuery->new;
$and_query->add_child($user_query);
$and_query->add_child($filter);
my $hits = $searcher->search( query => $and_query );
...

Filter is moving outside of core because it is essentially nothing
more a caching optimization. Logically, the following code would
produce exactly the same results as the code above:

while ( my $cgi = CGI::Fast->new ) {
my $user_query = $cgi->param('q');
my $category_query = KinoSearch::Search::TermQuery->new(
field => 'category',
term => $cgi->param('category'),
);
$category_query->set_boost(0);
my $and_query = KinoSearch::Search::ANDQuery->new;
$and_query->add_child($user_query);
$and_query->add_child($category_query);
my $hits = $searcher->search( query => $and_query );
...

The only significant differences are that the Filter only runs the
query once, and that it can't be serialized and sent over the network
in a search cluster (because the search results are cached in a
BitVector which is too big to send).

Lucene provides classes called RemoteCachingWrapperFilter and
FilterManager that address the problem of filter caching in search
clusters, and whose functionality might eventually end up in either
KSx::Remote or KSx::Search::Filter. Again, though, they are caching
optimizations with serialization limitations and as such belong
outside of core.

I thought about keeping Filter as an abstract base class, and putting
the actual functionality into KSx::Search::QueryFilter or something
like that. However, after reviewing the various Filter subclasses in
both Lucene's core and contrib, it looked to me as though nearly all
of them (all except for the SpanFilter subclasses which would need to
be different anyway) could be realized using either ordinary Queries
or Queries in conjunction with this new implementation of Filter.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/


_______________________________________________
KinoSearch mailing list
KinoSearch@rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch
Re: Compacting the core [ In reply to ]
On Thu, Jun 12, 2008 at 8:08 PM, Marvin Humphrey <marvin@rectangular.com> wrote:
> KinoSearch is spinning off a few modules, to cut down on the core size and
> complexity.

I was cheering for both of these changes as they came across on the
commits list. I think this is an excellent direction. Getting the
core down to as small as possible sounds like a great plan. Good
work!

Nathan Kurz
nate@verse.com

_______________________________________________
KinoSearch mailing list
KinoSearch@rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch
Re: Compacting the core [ In reply to ]
I also saw that you also removed AndNotQuery, which strikes me as a
good simplification as well.

I wonder if it would be a good test case for a KSx package. The code
is already done, and it might be a good check of how easy it is to add
similar functionality outside the core. My impressions from before
were that writing custom scorers was fairly straightforward, but that
getting them to be used felt clumsy. Probably this was just me
missing something elegant, and an example of an optimization like this
might serve as a good example for some future me.

Nathan Kurz
nate@verse.com

_______________________________________________
KinoSearch mailing list
KinoSearch@rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch