On Aug 20, 2006, at 1:43 AM, henka@cityweb.co.za wrote:
> Let's say you have an existing master index with special fields
> field_a,
> field_b and field_c.
>
> You also have thousands of temp indexes with the same fields.
> These temp
> indexes are periodically absorbed into the master with
> $idx->add_invindexes($temp).
>
> Is it possible to change the structure of the temp index by (eg)
> adding a
> special field (or removing), then merging this altered temp index
> with the
> master index which still has the old structure?
It's not possible to update documents. Merging of segments is
already quite complex -- document and field numbers have to be
remapped for _each index entry_, deleted docs have to be removed both
from storage and from the index, etc, all the while maintaining
absolute integrity of a sophisticated file format...
Delete and re-add is the only strategy which works. Updating a
document in a KinoSearch index is never going to be like updating a
row in an SQL table -- the emphasis in an inverted indexer is on the
index, not the data. Depending on your Analysis chain, it may
actually be faster to index 100 docs from scratch than it is to merge
in a 100 document sub-index! That would be difficult to measure for
esoteric reasons, but the point is that interleaving two inverted
indexes into one involves a lot of tricky processing.
The way to handle this sort of thing is to keep a separate database
with the original documents. Tie the db to the KS index using a
unique id field, which should be indexed but not analyzed (whether or
not it is stored is irrelevant).
$invindexer->spec_field(
name => 'id',
analyzed => 0, # don't e.g. stem the id!
indexed => 1, # this is the default
stored => 0, # doesn't matter
);
Then later, you can use the id field to remove documents that have
been modified, just before you re-add them.
while ( my $modified_doc = $sth->fetchrow_hashref ) {
$invindexer->delete_docs_by_term( $modified_doc->{id} );
my $doc = $invindexer->new_doc;
$doc->set_value( id => $modified_doc->{id} );
# ...
}
Maintaining multiple copies of the data may sound redundant, but
again, the emphasis in KinoSearch is on the index, not the data -- it
should not serve as primary document storage.
Cheers,
Marvin Humphrey
--
I'm looking for a part time job.