Mailing List Archive

Index structure changing during runtime
Here's a question I've been wondering about for some time, and haven't got
around to testing yet.

Let's say you have an existing master index with special fields field_a,
field_b and field_c.

You also have thousands of temp indexes with the same fields. These temp
indexes are periodically absorbed into the master with
$idx->add_invindexes($temp).

Is it possible to change the structure of the temp index by (eg) adding a
special field (or removing), then merging this altered temp index with the
master index which still has the old structure?

If so, what effect would this have on searching?
Index structure changing during runtime [ In reply to ]
On Aug 20, 2006, at 1:43 AM, henka@cityweb.co.za wrote:

> Let's say you have an existing master index with special fields
> field_a,
> field_b and field_c.
>
> You also have thousands of temp indexes with the same fields.
> These temp
> indexes are periodically absorbed into the master with
> $idx->add_invindexes($temp).
>
> Is it possible to change the structure of the temp index by (eg)
> adding a
> special field (or removing), then merging this altered temp index
> with the
> master index which still has the old structure?

It's not possible to update documents. Merging of segments is
already quite complex -- document and field numbers have to be
remapped for _each index entry_, deleted docs have to be removed both
from storage and from the index, etc, all the while maintaining
absolute integrity of a sophisticated file format...

Delete and re-add is the only strategy which works. Updating a
document in a KinoSearch index is never going to be like updating a
row in an SQL table -- the emphasis in an inverted indexer is on the
index, not the data. Depending on your Analysis chain, it may
actually be faster to index 100 docs from scratch than it is to merge
in a 100 document sub-index! That would be difficult to measure for
esoteric reasons, but the point is that interleaving two inverted
indexes into one involves a lot of tricky processing.

The way to handle this sort of thing is to keep a separate database
with the original documents. Tie the db to the KS index using a
unique id field, which should be indexed but not analyzed (whether or
not it is stored is irrelevant).

$invindexer->spec_field(
name => 'id',
analyzed => 0, # don't e.g. stem the id!
indexed => 1, # this is the default
stored => 0, # doesn't matter
);

Then later, you can use the id field to remove documents that have
been modified, just before you re-add them.

while ( my $modified_doc = $sth->fetchrow_hashref ) {
$invindexer->delete_docs_by_term( $modified_doc->{id} );
my $doc = $invindexer->new_doc;
$doc->set_value( id => $modified_doc->{id} );
# ...
}

Maintaining multiple copies of the data may sound redundant, but
again, the emphasis in KinoSearch is on the index, not the data -- it
should not serve as primary document storage.

Cheers,

Marvin Humphrey

--
I'm looking for a part time job.
Index structure changing during runtime [ In reply to ]
> It's not possible to update documents. Merging of segments is
> already quite complex -- document and field numbers have to be
> remapped for _each index entry_, deleted docs have to be removed both
> from storage and from the index, etc, all the while maintaining
> absolute integrity of a sophisticated file format...
>
> Delete and re-add is the only strategy which works.

Sorry - I didn't communicate clearly in the origional post: I meant
deleting and adding a temp index to the master (not updating).

What is the impact of having a master index with fields(a,b,c), and then
deleting doc(1) with fields(a,b,c) and attemping to add a temp index
(doc(1) again) with fields(a,b,c,d), or fields(a,c,d)?

> Maintaining multiple copies of the data may sound redundant, but
> again, the emphasis in KinoSearch is on the index, not the data -- it
> should not serve as primary document storage.

Ja, this is a given - the idea is to keep the origional HTML for "cache"
display anyway.
Index structure changing during runtime [ In reply to ]
On Aug 21, 2006, at 12:59 AM, henka@cityweb.co.za wrote:

> What is the impact of having a master index with fields(a,b,c), and
> then
> deleting doc(1) with fields(a,b,c) and attemping to add a temp index
> (doc(1) again) with fields(a,b,c,d), or fields(a,c,d)?

KinoSearch can handle that. Note that field d will not be added as
an empty string or whatever to docs which were indexed earlier
without it.

What I suggest you avoid doing is changing the definition of any
field. KS currently resolves such conflicts according to a weird set
of rules taken from Lucene, but that might not be true in the
future. My intent is to give KinoSearch some characteristics of an
object-oriented database, associating each fieldname with a class
which specs how records should be serialized/deserialized -- both in
the index and in storage. This is key to eliminating the tight
coupling of KS to its file format (and a host of other good things).
Under that system, associating a fieldname with conflicting defs will
produce an error.

Best,

Marvin Humphrey

--
I'm looking for a part time job.
Index structure changing during runtime [ In reply to ]
>> What is the impact of having a master index with fields(a,b,c), and
>> then
>> deleting doc(1) with fields(a,b,c) and attemping to add a temp index
>> (doc(1) again) with fields(a,b,c,d), or fields(a,c,d)?
>
> KinoSearch can handle that. Note that field d will not be added as
> an empty string or whatever to docs which were indexed earlier
> without it.

Thanks - this helps to get a better understanding of the underlying KS
behaviour: ie, I can add fields to expand search functionality as time
goes by without breaking the system.