Mailing List Archive: r3868 - in trunk/perl/lib/KinoSearch/Docs: . Cookbook

Author: creamyg
Date: 2008-09-10 20:57:22 -0700 (Wed, 10 Sep 2008)
New Revision: 3868

Modified:
trunk/perl/lib/KinoSearch/Docs/Cookbook.pod
trunk/perl/lib/KinoSearch/Docs/Cookbook/CachedSearcher.pod
trunk/perl/lib/KinoSearch/Docs/Cookbook/CustomQuery.pod
trunk/perl/lib/KinoSearch/Docs/Cookbook/CustomQueryParser.pod
Log:
Write another draft of the Cookbook.

Modified: trunk/perl/lib/KinoSearch/Docs/Cookbook/CachedSearcher.pod
===================================================================
--- trunk/perl/lib/KinoSearch/Docs/Cookbook/CachedSearcher.pod 2008-09-11 01:43:18 UTC (rev 3867)
+++ trunk/perl/lib/KinoSearch/Docs/Cookbook/CachedSearcher.pod 2008-09-11 03:57:22 UTC (rev 3868)
@@ -5,22 +5,22 @@

=head1 ABSTRACT

-At the core of every Searcher object is an IndexReader, and when an
-IndexReader object is created, a small portion of the InvIndex is loaded into
-memory. Additional caches are filled as relevant queries arrive.
+When a L<Searcher|KinoSearch::Searcher> object is created, a small portion of
+the invindex is loaded into memory; additional caches are filled as relevant
+queries arrive. For small document collections on lightly-loaded servers, the
+time it takes to warm up the Searcher isn't worth worrying about. For large
+document collections or busy servers, though, the warmup time may become
+significant, in which case reusing the Searcher is likely to speed up your
+application.

-For small document collections on lightly-loaded servers, the time to warm up
-the Searcher/Reader isn't worth worrying about. For large document
-collections or busy servers, the warmup time may become significant, in which
-case reusing the Searcher is likely to speed up your application.
-
=head1 FastCGI

-A script running under standard CGI runs once per request. In contrast, a
-script running on FastCGI webserver using the CGI::Fast module from CPAN
-starts upon the first request then executes a loop once per request.
+A script running under standard CGI runs once per request; in contrast, a
+script running on a FastCGI-enabled webserver using the CGI::Fast module from
+CPAN starts up on the first request then executes a loop once per request.

-Create your Searcher outside this loop:
+Create your Searcher outside this loop, so that the object persists over
+multiple requests:

my $searcher = KinoSearch::Searcher->new(
invindex => MySchema->read('/path/to/invindex/')
@@ -77,8 +77,9 @@
fetch hits 0.006 0.008 75.602%
_stop_ 0.000 0.008 0.186%

-As the numbers indicate, for a simple term query, the time to initialize the
-Searcher overwhelms the time to execute the search and return results.
+Its clear from those numbers that for a simple term query, the time it takes
+to initialize the Searcher swamps the time it takes to execute the search and
+return results.

=head1 COPYRIGHT

Modified: trunk/perl/lib/KinoSearch/Docs/Cookbook/CustomQuery.pod
===================================================================
--- trunk/perl/lib/KinoSearch/Docs/Cookbook/CustomQuery.pod 2008-09-11 01:43:18 UTC (rev 3867)
+++ trunk/perl/lib/KinoSearch/Docs/Cookbook/CustomQuery.pod 2008-09-11 03:57:22 UTC (rev 3868)
@@ -39,10 +39,10 @@

=back

-PrefixQuery on its own isn't enough because Query objects are mainly
-containers for metadata describing what to search for, and as such they can't
-do much -- they merely express a spec and leave the implementation of that
-spec to their companion classes.
+The PrefixQuery class on its own isn't enough because a Query object's role is
+limited to expressing an abstract specification for the search. A Query is
+basically nothing but metadata; execution is left to the Query's companion
+Compiler and Scorer.

Here's a simplified sketch illustrating how a Searcher's search() method ties
together the three classes.
@@ -57,13 +57,13 @@

=head2 PrefixQuery

-The PrefixQuery class has two basic attributes: a query string and a field
+Our PrefixQuery class will have two attributes: a query string and a field
name.

package PrefixQuery;
use base qw( KinoSearch::Search::Query );
use Carp;
-
+
# Inside-out member vars and hand-rolled accessors.
my %query_string;
my %field;
@@ -77,19 +77,14 @@
my $query_string = delete $args{query_string};
my $field = delete $args{field};
my $self = $class->SUPER::new(%args);
-
- # Validate and assign required parameters.
confess("'query_string' param is required")
unless defined $query_string;
+ confess("Invalid query_string: '$query_string'")
+ unless $query_string =~ /\*\s*$/;
confess("'field' param is required")
unless defined $field;
$query_string{$$self} = $query_string;
$field{$$self} = $field;
-
- # Only support trailing wildcards, i.e. "hous*" but not "hou*s".
- confess("Invalid query_string: '$query_string'")
- unless $query_string =~ /\*\s*$/;
-
return $self;
}

@@ -123,7 +118,7 @@

Searchable objects have access to certain statistical information about the
collections they represent; for instance, a Searchable can tell you how many
-documents there are...
+documents are in the collection...

my $maximum_number_of_docs_in_collection = $searchable->max_docs;

@@ -148,7 +143,7 @@

sub make_scorer {
my ( $self, $index_reader ) = @_;
-
+
# Acquire a Lexicon and seek it to our query string.
my $substring = $self->get_parent->get_query_string;
$substring =~ s/\*.\s*$//;
@@ -156,7 +151,7 @@
my $lexicon = $index_reader->lexicon( field => $field );
return unless $lexicon;
$lexicon->seek($substring);
-
+
# Accumulate PostingLists for each matching term.
my @posting_lists;
while ( defined( my $term = $lexicon->get_term ) ) {
@@ -171,17 +166,19 @@
last unless $lexicon->next;
}
return unless @posting_lists;
-
+
return PrefixScorer->new( posting_lists => \@posting_lists );
}

PrefixCompiler gets access to an
L<IndexReader|KinoSearch::Search::IndexReader> object when make_scorer() gets
-called. From the IndexReader we acquire a Lexicon, which is a list of a
-field's unique terms; we iterate over the terms in the Lexicon, acquiring a
-PostingList for each term that matches our prefix.
+called. From the IndexReader we acquire a
+L<Lexicon|KinoSearch::Index::Lexicon>, which is an iterator for a field's
+unique terms; we scan through the Lexicon's terms, acquiring a
+L<PostingList|KinoSearch::Index::PostingList> for each term that matches our
+prefix.

-Each of these PostingList objects represents a list of documents which match
+Each of these PostingList objects represents a set of documents which match
the query.

=head2 PrefixScorer
@@ -190,17 +187,17 @@

package PrefixScorer;
use base qw( KinoSearch::Search::Scorer );
-
+
# Inside-out member vars.
my %doc_nums;
my %tally;
my %tick;
-
+
sub new {
my ( $class, %args ) = @_;
my $posting_lists = delete $args{posting_lists};
my $self = $class->SUPER::new(%args);
-
+
# Cheesy but simple way of interleaving PostingList doc sets.
my %all_doc_nums;
for my $posting_list (@$posting_lists) {
@@ -210,11 +207,11 @@
}
my @doc_nums = sort { $a <=> $b } keys %all_doc_nums;
$doc_nums{$$self} = \@doc_nums;
-
+
$tick{$$self} = -1;
$tally{$$self} = KinoSearch::Search::Tally->new;
$tally{$$self}->set_score(1.0); # fixed score of 1.0
-
+
return $self;
}

@@ -259,13 +256,10 @@
return $tally{$$self};
}

-=head1 CONCLUSION
+=head1 Usage

-To see PrefixQuery action, try feeding it the query string in the sample US
-constitution search.cgi app.
-
-If you're feeling ambitious, you can also try extending
-KinoSearch::QueryParser to support PrefixQuery, as described in
+To try out PrefixQuery, insert the FlatQueryParser module (which supports
+PrefixQuery) into the search.cgi sample app, as described in
L<KinoSearch::Docs::Cookbook::CustomQueryParser>.

=head1 COPYRIGHT

Modified: trunk/perl/lib/KinoSearch/Docs/Cookbook/CustomQueryParser.pod
===================================================================
--- trunk/perl/lib/KinoSearch/Docs/Cookbook/CustomQueryParser.pod 2008-09-11 01:43:18 UTC (rev 3867)
+++ trunk/perl/lib/KinoSearch/Docs/Cookbook/CustomQueryParser.pod 2008-09-11 03:57:22 UTC (rev 3868)
@@ -5,7 +5,7 @@

=head1 ABSTRACT

-Create a custom search query language, using KinoSearch::QueryParser and
+Implement a custom search query language using KinoSearch::QueryParser and
Parse::RecDescent.

=head1 Grammar-based vs. hand-rolled
@@ -48,7 +48,7 @@
We'll use a fixed field name of "content", and a fixed choice of English
PolyAnalyzer.

- package SimpleQueryParser;
+ package FlatQueryParser;
use KinoSearch::Search::TermQuery;
use KinoSearch::Search::PhraseQuery;
use KinoSearch::Search::ORQuery;
@@ -83,8 +83,8 @@
);
}

-This private _tokenize() method treats double-quote delimited material as a
-phrase and everything else as a term:
+Our private _tokenize() method treats double-quote delimited material as a
+single token and splits on whitespace everywhere else.

sub _tokenize {
my ( $self, $query_string ) = @_;
@@ -106,7 +106,9 @@

The main parsing routine creates an array of tokens by calling _tokenize(),
runs the tokens through through the PolyAnalyzer, creates TermQuery or
-PhraseQuery objects, and adds each of the sub-queries to the primary ORQuery.
+PhraseQuery objects according to how many tokens emerge from the
+PolyAnalyzer's split() method, and adds each of the sub-queries to the primary
+ORQuery.

sub parse {
my ( $self, $query_string ) = @_;
@@ -204,11 +206,9 @@
this time -- KinoSearch::QueryParser's constructor requires a Schema which
conveys field and Analyzer information, so we can just defer to that.

- package SimpleQueryParser;
+ package FlatQueryParser;
use base ( KinoSearch::QueryParser );

- ...
-
our %rd_parser;

sub new {
@@ -272,7 +272,7 @@
and if multiple fields are required, creates an ORQuery which mults out e.g.
C<foo> into C<(title:foo OR content:foo)>.

-=head1 Extending the query language.
+=head1 Extending the query language

To add support for trailing wildcards to our query language, first we need to
modify our grammar, adding a C<prefix_query> production and tweaking the
@@ -283,7 +283,7 @@
| prefix_query
| term_query

- preix_query:
+ prefix_query:
/(\w+\*)/
{ KinoSearch::Search::LeafQuery->new( text => $1 ) }

@@ -310,12 +310,12 @@
}
}

-=head1 USAGE
+=head1 Usage

Insert any of our custom parsers into the search.cgi sample app to get a feel
for how they behave:

- my $parser = SimpleQueryParser->new( schema => $searcher->get_schema );
+ my $parser = FlatQueryParser->new( schema => $searcher->get_schema );
my $query = $parser->parse( $cgi->param('q') || '' );
my $hits = $searcher->search(
query => $query,

Modified: trunk/perl/lib/KinoSearch/Docs/Cookbook.pod
===================================================================
--- trunk/perl/lib/KinoSearch/Docs/Cookbook.pod 2008-09-11 01:43:18 UTC (rev 3867)
+++ trunk/perl/lib/KinoSearch/Docs/Cookbook.pod 2008-09-11 03:57:22 UTC (rev 3868)
@@ -4,16 +4,10 @@

=head1 DESCRIPTION

-Each of the recipes in the Cookbook uses the completed
-L<Tutorial|KinoSearch::Docs::Tutorial> application as its point of departure.
-The materials can be found in the C<sample> directory at the root of the
-KinoSearch distribution:
+The Cookbook provides thematic documentation covering some of KinoSearch's
+more sophisticated features. For a step-by-step introduction to KinoSearch,
+see L<KinoSearch::Docs::Tutorial>.

- sample/USConSchema.pm # custom KinoSearch::Schema subclass
- sample/invindexer.plx # indexing app
- sample/search.cgi # search app
- sample/us_constitution # html documents
-
=head2 Chapters

=over
@@ -21,7 +15,7 @@
=item *

L<KinoSearch::Docs::Cookbook::CachedSearcher> - Improve search-time
-performance under FastCGI or mod_perl by reusing a cached Searcher/IndexReader.
+performance under FastCGI or mod_perl by reusing a cached Searcher.

=item *

@@ -31,11 +25,22 @@

=item *

-L<KinoSearch::Docs::Cookbook::CustomQueryParser> - Create a custom search
-query language, using KinoSearch::QueryParser and Parse::RecDescent.
+L<KinoSearch::Docs::Cookbook::CustomQueryParser> - Define your own custom
+search query syntax using KinoSearch::QueryParser and Parse::RecDescent.

=back

+=head2 Materials
+
+Some of the recipes in the Cookbook reference the completed
+L<Tutorial|KinoSearch::Docs::Tutorial> application. These materials can be
+found in the C<sample> directory at the root of the KinoSearch distribution:
+
+ sample/USConSchema.pm # custom KinoSearch::Schema subclass
+ sample/invindexer.pl # indexing app
+ sample/search.cgi # search app
+ sample/us_constitution # html documents
+
=head1 COPYRIGHT

Copyright 2005-2008 Marvin Humphrey

_______________________________________________
kinosearch-commits mailing list
kinosearch-commits@rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch-commits