Mailing List Archive

Playing with MultiSearcher framework
Hello,

I've started playing around with MultiSearcher (MS) and have hit a snag.
If you could shed some light on what I'm doing wrong, it would be
appreciated.

Given:
kinosearch dev revision 2605
Each server has a basic ::SearchServer process running.
Passwords, ports, etc all match.
Searching on a single machine without MS works as expected.
Searching using all machines with MS and with highlighting+sort results in
errors (see below for more detail).
Searching using all machines with MS but without highlighting+sort works
fine - sans highlighting+sort effects of course.

MultiSearcher code:

...
use Test::Schema;
use KinoSearch::Searcher;
use KinoSearch::Highlight::Highlighter;
use KinoSearch::Search::SortSpec;
use KinoSearch::Search::MultiSearcher; # just include 'em all.
use KinoSearch::Search::SearchServer;
use KinoSearch::Search::SearchClient;
...
my $sort_spec = KinoSearch::Search::SortSpec->new;
$sort_spec->add(
field => 'sort_field',
reverse => 1,
);
...
my $port=7890;
my $pass = 'searchpw';
my @searchers;
my @server_names = ( '10.1.1.10', '10.1.1.11', '10.1.1.12', ... );
my $schema = Test::Schema->new;

for my $server_name (@server_names) {
push @searchers,
KinoSearch::Search::SearchClient->new(
peer_address => "$server_name:$port",
password => $pass,
schema => $schema,
);
}

my $multi_searcher = KinoSearch::Search::MultiSearcher->new(
searchables => \@searchers,
schema => $schema,
);

my $hits = $multi_searcher->search(
query => $q,
offset => $offset,
num_wanted => $hits_per_page,
sort_spec => $sort_spec,
);
KinoSearch::Search::MultiSearcher->set_enable_sorting(1); # (A)

#my $highlighter = KinoSearch::Highlight::Highlighter->new;
#$highlighter->add_spec( field => 'body' );
#$highlighter->add_spec( field => 'title' );
#$hits->create_excerpts( );
#$hits->create_excerpts( highlighter => $highlighter );
...

Searching using the above code results in a "sort_spec not currently
supported by MultiSearcher..." error no matter where I place (A).

Commenting (ignorant quick-hack, I know) out:
#confess("sort_spec not currently supported by MultiSearcher")
# if ( $sort_spec and !$enable_sorting );
in /usr/lib/perl5/.../KinoSearch/Search/MultiSearcher.pm
results in an error cascade starting with "Use of uninitialized value in
read at...SearchClient.pm line 66, <GEN2> line 1.".
This also kills the ::SearchServer process with the error "Can't locate
object method "make_field_doc_collator" via package "KinoSearch::Se
arch::SortSpec"...Searcher.pm line 62, <GEN2> line 3.".

Commenting out all the sort code, but enabling the highlighter code,
results in the error: "Can't call method "set_terms" on an undefined value
at...Search/Hits.pm line 73, <GEN2> line 1."

Commenting out all the sort and highlighter code results in successful
searching across all search nodes (but without sorting and highlighting of
course).

Sorry for long rambling post. Any pointers would be appreciated.

Regards
Henry


_______________________________________________
KinoSearch mailing list
KinoSearch@rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch
Re: Playing with MultiSearcher framework [ In reply to ]
On Nov 2, 2007, at 5:45 AM, Henry wrote:

> my $hits = $multi_searcher->search(
> query => $q,
> offset => $offset,
> num_wanted => $hits_per_page,
> sort_spec => $sort_spec,
> );
> KinoSearch::Search::MultiSearcher->set_enable_sorting(1); # (A)

> Searching using the above code results in a "sort_spec not currently
> supported by MultiSearcher..." error no matter where I place (A).

I don't get that. Does it happen even if you place it right after
the 'use' directive? That error occurs in $multi_searcher->top_docs
(), which is called internally by $multi_searcher->search(). Calling
set_enable_sorting() any time prior to the $multi_searcher->search()
command ought to work.

For those not yet in the know (which is everyone except Henry)... I
hacked in undocumented support for remote sorting, but it's crippled
by default because the implementation is less than ideal.

> Commenting (ignorant quick-hack, I know) out:
> #confess("sort_spec not currently supported by MultiSearcher")
> # if ( $sort_spec and !$enable_sorting );
> in /usr/lib/perl5/.../KinoSearch/Search/MultiSearcher.pm
> results in an error cascade starting with "Use of uninitialized
> value in
> read at...SearchClient.pm line 66, <GEN2> line 1.".

Line 66 is where the SearchClient tries to read from the socket.
It's failing because the remote node isn't responding.

> This also kills the ::SearchServer process with the error "Can't
> locate
> object method "make_field_doc_collator" via package "KinoSearch::Se
> arch::SortSpec"...Searcher.pm line 62, <GEN2> line 3.".

It looks like KinoSearch::Searcher does not contain a "use
KinoSearch::Search::SortSpec" directive, so you'll have to add that
to the scripts running on the slave nodes. I should probably add
that to Searcher. Hey, what's one more module to load? :\

> Commenting out all the sort code, but enabling the highlighter code,
> results in the error: "Can't call method "set_terms" on an
> undefined value
> at...Search/Hits.pm line 73, <GEN2> line 1."

I think that's happening because of this:

#$hits->create_excerpts( );
#$hits->create_excerpts( highlighter => $highlighter );

'highlighter' is a required argument for $hits->create_excerpts, so
that first line would fail. I should probably add validation code to
create_excerpts() so that a more meaningful error message gets produced.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/



_______________________________________________
KinoSearch mailing list
KinoSearch@rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch
Re: Playing with MultiSearcher framework [ In reply to ]
>> KinoSearch::Search::MultiSearcher->set_enable_sorting(1); # (A)
>
>> Searching using the above code results in a "sort_spec not currently
>> supported by MultiSearcher..." error no matter where I place (A).
>
> I don't get that. Does it happen even if you place it right after
> the 'use' directive? That error occurs in $multi_searcher->top_docs
> (), which is called internally by $multi_searcher->search(). Calling
> set_enable_sorting() any time prior to the $multi_searcher->search()
> command ought to work.

This one is probably my dumb. Will try to duplicate that error later.

> It looks like KinoSearch::Searcher does not contain a "use
> KinoSearch::Search::SortSpec" directive, so you'll have to add that
> to the scripts running on the slave nodes. I should probably add
> that to Searcher. Hey, what's one more module to load? :\

OK - added the 'use' line to all nodes. That's resolved that one.

>> Commenting out all the sort code, but enabling the highlighter code,
>> results in the error: "Can't call method "set_terms" on an
>> undefined value
>> at...Search/Hits.pm line 73, <GEN2> line 1."
>
> I think that's happening because of this:
>
> #$hits->create_excerpts( );
> #$hits->create_excerpts( highlighter => $highlighter );
>
> 'highlighter' is a required argument for $hits->create_excerpts, so
> that first line would fail. I should probably add validation code to
> create_excerpts() so that a more meaningful error message gets produced.

Right you are; sorry for missing that.

OK, using the following to create excerpts results in the error below:

my $highlighter = KinoSearch::Highlight::Highlighter->new;
$highlighter->add_spec( field => 'body' );
$highlighter->add_spec( field => 'title' );
$hits->create_excerpts( highlighter => $highlighter );

Can't call method "term_vector" on unblessed reference at
/usr/lib/perl5/site_perl/5.8.8/i386-linux-thread-multi/KinoSearch/Highlight/Highlighter.pm
line 226, <GEN2> line 1.

Disabling the above code yields successful search results (without
highlighting of course) with sorting.

Performance (0.5-0.7s) is not bad at all Marvin (admittedly on a small
subset of the full index), excellent work! If this distributed search
implementation is less than ideal, then I would imagine there are great
things to come.

I'll be gradually increasing the size of the sub-indexes on all the nodes
and will provide feedback on performance.

Regards
Henry


_______________________________________________
KinoSearch mailing list
KinoSearch@rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch
Re: Playing with MultiSearcher framework [ In reply to ]
On Nov 5, 2007, at 12:15 AM, Henry wrote:

>> It looks like KinoSearch::Searcher does not contain a "use
>> KinoSearch::Search::SortSpec" directive, so you'll have to add that
>> to the scripts running on the slave nodes. I should probably add
>> that to Searcher. Hey, what's one more module to load? :\
>
> OK - added the 'use' line to all nodes. That's resolved that one.

Groovy. I've added a 'use SortSpec' directive to Searchable as of
r2608.

>> 'highlighter' is a required argument for $hits->create_excerpts, so
>> that first line would fail. I should probably add validation code to
>> create_excerpts() so that a more meaningful error message gets
>> produced.
>
> Right you are; sorry for missing that.

OK, glad that problem's solved. I've strengthened the param checking
with r2609.

> OK, using the following to create excerpts results in the error below:
>
> my $highlighter = KinoSearch::Highlight::Highlighter->new;
> $highlighter->add_spec( field => 'body' );
> $highlighter->add_spec( field => 'title' );
> $hits->create_excerpts( highlighter => $highlighter );
>
> Can't call method "term_vector" on unblessed reference at
> /usr/lib/perl5/site_perl/5.8.8/i386-linux-thread-multi/KinoSearch/
> Highlight/Highlighter.pm
> line 226, <GEN2> line 1.

Hmm, curious. It's not immediately apparent why that's happening.

However, I have a kill-many-birds-with-one-stone solution up my
sleeve. We're currently fetching the document correctly. So let's
add the term vector data to the document itself. Put it in an
auxiliary, binary field: e.g. content_HIGHLIGHTDATA.

The primary downside is that such a change is not backwards
compatible, but we just made one backwards-incompatible change (doc
nums starting at 1). So it's time to jam in a bunch, while writing
the file format spec.

A side effect is that highlighting won't be enabled by default any
more. That's a little less convenient, but it also means indexes
will default to being smaller.

The *major* upside is that term vectors won't need to be part of the
InvIndex file format spec. :) That section was going to be a PITA,
and by ditching it, we keep things simple and finish the spec sooner.

> Performance (0.5-0.7s) is not bad at all Marvin (admittedly on a small
> subset of the full index), excellent work!

Is there a performance difference between plain search and sorted
search? And are the invindexes optimized?

The primary theoretical flaw in the current sorted remote search
implementation is that there may be a lot of disk thrash for an un-
optimized index as term numbers are converted into terms.

> If this distributed search
> implementation is less than ideal, then I would imagine there are
> great
> things to come.

Here's what I have in mind:

SegWriter becomes a public module, and takes on an API similar to
that of PolyAnalyzer -- i.e. it becomes an array of writers. This
will allow us to subclass DocWriter with e.g. DistributedDocWriter.
(PrimaryKeyOnlyDocWriter would be another useful possibility, if
you're combining KS with an RDBMS). That would allow us to have
dedicated machines performing the role of fetching/highlighting.

Lexicons would be handled in similar fashion, as would posting
lists. The idea is to modularize things by task and write
specialized modules for a distributed setup. This is how e.g. Google
does things, and I believe it's a better model than the current
MultiSearcher.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/



_______________________________________________
KinoSearch mailing list
KinoSearch@rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch
Re: Playing with MultiSearcher framework [ In reply to ]
>> Can't call method "term_vector" on unblessed reference at
>> /usr/lib/perl5/site_perl/5.8.8/i386-linux-thread-multi/KinoSearch/
>> Highlight/Highlighter.pm
>> line 226, <GEN2> line 1.
>
> Hmm, curious. It's not immediately apparent why that's happening.
>
> However, I have a kill-many-birds-with-one-stone solution up my
> sleeve. We're currently fetching the document correctly. So let's
> add the term vector data to the document itself. Put it in an
> auxiliary, binary field: e.g. content_HIGHLIGHTDATA.
>
> The primary downside is that such a change is not backwards
> compatible, but we just made one backwards-incompatible change (doc
> nums starting at 1). So it's time to jam in a bunch, while writing
> the file format spec.

Sounds good. I've paused global indexing for the time being anyway - busy
with data consolidation/rank analysis, etc.

> A side effect is that highlighting won't be enabled by default any
> more. That's a little less convenient, but it also means indexes
> will default to being smaller.

Smaller == faster (from an IO perspective anyway). So this is good news
indeed. I've already chomped the size of my indexes dramatically by
limiting the document sizes (reasonable 100k).

> The *major* upside is that term vectors won't need to be part of the
> InvIndex file format spec. :) That section was going to be a PITA,
> and by ditching it, we keep things simple and finish the spec sooner.

Excellent. KISS is good.

>> Performance (0.5-0.7s) is not bad at all Marvin (admittedly on a small
>> subset of the full index), excellent work!
>
> Is there a performance difference between plain search and sorted
> search? And are the invindexes optimized?

[quick repeated tests without caching, nodes have no other activity]

With sort: ~0.515s
Without: ~0.450s

All indexes optimized.

> The primary theoretical flaw in the current sorted remote search
> implementation is that there may be a lot of disk thrash for an un-
> optimized index as term numbers are converted into terms.
>
>> If this distributed search
>> implementation is less than ideal, then I would imagine there are
>> great
>> things to come.
>
> Here's what I have in mind:
>
> SegWriter becomes a public module, and takes on an API similar to
> that of PolyAnalyzer -- i.e. it becomes an array of writers. This
> will allow us to subclass DocWriter with e.g. DistributedDocWriter.
> (PrimaryKeyOnlyDocWriter would be another useful possibility, if
> you're combining KS with an RDBMS). That would allow us to have
> dedicated machines performing the role of fetching/highlighting.

Great idea - distribute not only searching, but other processing as well.

> Lexicons would be handled in similar fashion, as would posting
> lists. The idea is to modularize things by task and write
> specialized modules for a distributed setup. This is how e.g. Google
> does things, and I believe it's a better model than the current
> MultiSearcher.

A nice modular distributed approach allowing more flexibility in terms of
overall (end-user) design and performance. Great for scaling up (and
sideways)...

I'm curios, this sounds like quite a bit of work - what's your thinking in
terms of schedule/time-line.

Regards
Henry


_______________________________________________
KinoSearch mailing list
KinoSearch@rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch