Mailing List Archive: r3786 - trunk/perl/lib/KinoSearch/Docs/Tutorial

Author: creamyg
Date: 2008-08-28 22:16:16 -0700 (Thu, 28 Aug 2008)
New Revision: 3786

Modified:
trunk/perl/lib/KinoSearch/Docs/Tutorial/Analysis.pod
trunk/perl/lib/KinoSearch/Docs/Tutorial/FieldSpec.pod
Log:
Flesh out the Analysis and FieldSpec chapters in the Tutorial.

Modified: trunk/perl/lib/KinoSearch/Docs/Tutorial/Analysis.pod
===================================================================
--- trunk/perl/lib/KinoSearch/Docs/Tutorial/Analysis.pod 2008-08-28 22:27:10 UTC (rev 3785)
+++ trunk/perl/lib/KinoSearch/Docs/Tutorial/Analysis.pod 2008-08-29 05:16:16 UTC (rev 3786)
@@ -4,8 +4,72 @@

=head1 DESCRIPTION

-TODO
+Try swapping out the PolyAnalyzer in USConSchema for a Tokenizer:

+ package USConSchema;
+ use base qw( KinoSearch::Schema );
+
+ use KinoSearch::Analysis::Tokenizer;
+
+ sub analyzer { return KinoSearch::Analysis::Tokenizer->new }
+
+Try searching for C<senate>, C<Senate>, and C<Senator> before and after making
+the change and re-indexing.
+
+Under PolyAnalyzer, the results are identical for all three searches, but
+under Tokenizer, searches are case-sensitive, and the result sets for
+C<Senate> and C<Senator> are distinct.
+
+What's happening is that PolyAnalyzer is performing more aggressive processing
+than Tokenizer. In addition to tokenizing, it's also converting all text to
+lower case so that searches are case-insensitive, and using a "stemming"
+algorithm to reduce related words to a common stem (C<senat>, in this case).
+
+PolyAnalyzer is actually multiple Analyzers wrapped up in a single package.
+In this case, it's three-in-one, since specifying a PolyAnalyzer with
+C<< language => 'en' >> is equivalent to this snippet:
+
+ my $lc_normalizer = KinoSearch::Analysis::LCNormalizer->new;
+ my $tokenizer = KinoSearch::Analysis::Tokenizer->new;
+ my $stemmer = KinoSearch::Analysis::Stemmer->new( language => 'en' );
+ my $polyanalyzer = KinoSearch::Analysis::PolyAnalyzer->new(
+ analyzers => [ $lc_normalizer, $tokenizer, $stemmer ],
+ );
+
+You can add or subtract Analyzers from there if you like. Try adding a fourth
+Analyzer, a Stopalizer for suppressing "stopwords" like C<the>, C<if>,
+C<maybe>, and so on.
+
+ my $stopalizer = KinoSearch::Analysis::Stopalizer->new(
+ language => 'en',
+ );
+ my $polyanalyzer = KinoSearch::Analysis::PolyAnalyzer->new(
+ analyzers => [ $lc_normalizer, $tokenizer, $stopalizer, $stemmer ],
+ );
+
+Also, try removing the Stemmer.
+
+ my $polyanalyzer = KinoSearch::Analysis::PolyAnalyzer->new(
+ analyzers => [ $lc_normalizer, $tokenizer ],
+ );
+
+The original choice probably still yields the best results for this document
+collection, but you get the idea: sometimes you want a different Analyzer.
+
+=head2 When the best Analyzer is no Analyzer
+
+Sometimes you don't want an Analyzer at all. For instance, "category" fields
+are often set up to match exactly or not at all, as are fields like
+"last_name" (because you probably don't want to conflate results for
+"Humphrey" and "Humphries").
+
+To specify that there should be no analysis performed at all, use a custom
+FieldSpec:
+
+ package MySchema::NotAnalyzed;
+ use base qw( KinoSearch::FieldSpec::TextField );
+ sub analyzed { 0 }
+
=head1 COPYRIGHT

Copyright 2008 Marvin Humphrey

Modified: trunk/perl/lib/KinoSearch/Docs/Tutorial/FieldSpec.pod
===================================================================
--- trunk/perl/lib/KinoSearch/Docs/Tutorial/FieldSpec.pod 2008-08-28 22:27:10 UTC (rev 3785)
+++ trunk/perl/lib/KinoSearch/Docs/Tutorial/FieldSpec.pod 2008-08-29 05:16:16 UTC (rev 3786)
@@ -5,8 +5,53 @@

=head1 DESCRIPTION

-TODO
+The Schema subclass we used in the last chapter specifies three fields:

+ our %fields = (
+ title => 'text',
+ content => 'text',
+ url => 'text',
+ );
+
+Since they are all defined as "text" fields, they are all searchable --
+including the C<url> field, a dubious choice. Some URLs contain meaningful
+information, but these don't, really:
+
+ http://example.com/us_constitution/amend1.html
+
+We may as well not bother indexing the URL content. To achieve that we need
+to assign the C<url> field to a custom subclass of
+L<KinoSearch::FieldSpec::TextField>.
+
+ package USConSchema::NotIndexed;
+ use base qw( KinoSearch::FieldSpec::TextField );
+ sub indexed { 0 }
+
+ package USConSchema;
+ use base qw( KinoSearch::Schema );
+
+ our %fields = (
+ title => 'text',
+ content => 'text',
+ url => 'USConSchema::NotIndexed',
+ );
+
+To observe the change in behavior, try searching for C<us_constitution> both
+before and after changing the Schema and re-indexing.
+
+=head2 Toggling stored()
+
+For a taste of other FieldSpec possibilities, try turning off stored() for
+one or more fields.
+
+ package USConSchema::NotStored;
+ use base qw( KinoSearch::FieldSpec::TextField );
+ sub stored { 0 }
+
+Turning it off stored() for either C<title> or C<url> mangles our results
+page, but since we're not displaying C<content>, turning it off for C<content>
+has no effect -- except on index size.
+
=head1 COPYRIGHT

Copyright 2008 Marvin Humphrey

_______________________________________________
kinosearch-commits mailing list
kinosearch-commits@rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch-commits