Here is my code:
#!/usr/bin/perl
use strict;
use warnings;
package Schema;
use base qw( KinoSearch::Schema );
use KinoSearch::Analysis::PolyAnalyzer;
our %fields = ( title => 'KinoSearch::Schema::FieldSpec' );
sub analyzer { KinoSearch::Analysis::PolyAnalyzer->new( language => 'en' ) }
package main;
use File::Find;
use KinoSearch::InvIndexer;
my $index = KinoSearch::InvIndexer->new(invindex => Schema->clobber('index'));
find(\&wanted, "en");
$index->finish();
sub wanted {
/\.html$/ or return;
my $filename = $_;
my %field;
open my $fh, $filename or die "$filename: $!";
while (<$fh>) {
m!<body>! and last;
if (m!<title>(.*)</title>!) {
$field{title} = $1;
last;
}
}
close $fh;
$index->add_doc(\%field);
}
I'm running this with KinoSearch-0.20_03 from CPAN. It needs a
reasonably big collection of files, like 50,000 of them. I've used a
static dump from wikipedia. If you want to try that you need to
install 7zip, if you're running Debian the package name is p7zip-full.
Assuming you want to use the wiki dump and you've put the code in
index_wiki.pl the steps to run look like this:
wget http://static.wikipedia.org/downloads/April_2007/en/wikipedia-en-html.0.7z
7z x wikipedia-en-html.0.7z
perl index_wiki.pl
The output I get is:
Error in function kino_FSFolder_open_outstream at
c_src/KinoSearch/Store/FSFolder.c:56: Can't open '_1.skip': No such
file or directory
at /home/edward/src/KinoSearch-0.20_03/blib/lib/KinoSearch/Index/SegWriter.pm
line 121
KinoSearch::Index::SegWriter::add_doc('KinoSearch::Index::SegWriter=HASH(0x816bdfc)',
'HASH(0x890e790)', 1) called at
/home/edward/src/KinoSearch-0.20_03/blib/lib/KinoSearch/InvIndexer.pm
line 114
KinoSearch::InvIndexer::add_doc('KinoSearch::InvIndexer=HASH(0x816b7c0)',
'HASH(0x890e790)') called at ./index_wiki.pl line 42
main::wanted() called at /usr/share/perl/5.8/File/Find.pm line 886
File::Find::_find_dir('HASH(0x816c00c)', 'en', 8) called at
/usr/share/perl/5.8/File/Find.pm line 700
File::Find::_find_opt('HASH(0x816c00c)', 'en') called at
/usr/share/perl/5.8/File/Find.pm line 1223
File::Find::find('CODE(0x8337cac)', 'en') called at
./index_wiki.pl line 23
The line numbers in index_wiki.pl are wrong because I took out the
'use lib' line in the sample above.
Let me know if you need any more info.
--
Edward Betts
#!/usr/bin/perl
use strict;
use warnings;
package Schema;
use base qw( KinoSearch::Schema );
use KinoSearch::Analysis::PolyAnalyzer;
our %fields = ( title => 'KinoSearch::Schema::FieldSpec' );
sub analyzer { KinoSearch::Analysis::PolyAnalyzer->new( language => 'en' ) }
package main;
use File::Find;
use KinoSearch::InvIndexer;
my $index = KinoSearch::InvIndexer->new(invindex => Schema->clobber('index'));
find(\&wanted, "en");
$index->finish();
sub wanted {
/\.html$/ or return;
my $filename = $_;
my %field;
open my $fh, $filename or die "$filename: $!";
while (<$fh>) {
m!<body>! and last;
if (m!<title>(.*)</title>!) {
$field{title} = $1;
last;
}
}
close $fh;
$index->add_doc(\%field);
}
I'm running this with KinoSearch-0.20_03 from CPAN. It needs a
reasonably big collection of files, like 50,000 of them. I've used a
static dump from wikipedia. If you want to try that you need to
install 7zip, if you're running Debian the package name is p7zip-full.
Assuming you want to use the wiki dump and you've put the code in
index_wiki.pl the steps to run look like this:
wget http://static.wikipedia.org/downloads/April_2007/en/wikipedia-en-html.0.7z
7z x wikipedia-en-html.0.7z
perl index_wiki.pl
The output I get is:
Error in function kino_FSFolder_open_outstream at
c_src/KinoSearch/Store/FSFolder.c:56: Can't open '_1.skip': No such
file or directory
at /home/edward/src/KinoSearch-0.20_03/blib/lib/KinoSearch/Index/SegWriter.pm
line 121
KinoSearch::Index::SegWriter::add_doc('KinoSearch::Index::SegWriter=HASH(0x816bdfc)',
'HASH(0x890e790)', 1) called at
/home/edward/src/KinoSearch-0.20_03/blib/lib/KinoSearch/InvIndexer.pm
line 114
KinoSearch::InvIndexer::add_doc('KinoSearch::InvIndexer=HASH(0x816b7c0)',
'HASH(0x890e790)') called at ./index_wiki.pl line 42
main::wanted() called at /usr/share/perl/5.8/File/Find.pm line 886
File::Find::_find_dir('HASH(0x816c00c)', 'en', 8) called at
/usr/share/perl/5.8/File/Find.pm line 700
File::Find::_find_opt('HASH(0x816c00c)', 'en') called at
/usr/share/perl/5.8/File/Find.pm line 1223
File::Find::find('CODE(0x8337cac)', 'en') called at
./index_wiki.pl line 23
The line numbers in index_wiki.pl are wrong because I took out the
'use lib' line in the sample above.
Let me know if you need any more info.
--
Edward Betts