Mailing List Archive

Optimize on finish is affecting search results
Just a curiosity...

It's my understanding that passing "optimize => 1" to finish() after
making a lot of changes will result in an index that's optimized for
speed. However, in addition to that, I'm finding that it's having an
effect on search results as well, albeit a positive one. My problem
is that some queries only work on an optimized index.

Consider a document with the following content:

"salad.robot mercenary"

Just random words that won't be gobbled up by the stop list. Consider
also that the tokenizing expression just looks for words. The content
would be split like: "salad|robot|mercenary".

After adding this document to my index for the first time, I can find
it with any of the following queries:

salad
robot
mercenary
salad.robot
"robot mercenary"
"salad.robot mercenary"

If I then re-index the document without making any changes to the
content, essentially just remove it and add it, and then call the
non-optimizing finish(), all of the above queries continue to work
accept for "salad.robot". That query does work if I optimize the
index after re-adding the document, however.

Perhaps I don't fully understand what KinoSearch is doing with that
query, but I suspect "salad.robot" is the equivalent to asking for
"token salad followed by token robot". Indeed, I should be able to
replace the period with any other token barrier. For example, this
should work equally well:

salad!?!!!robot

and indeed it does, but only after optimizing the index.

Granted, this may seem like an odd sort of search to perform. If the
period was important to me, I could change the tokenizer so that it
includes it in the list of characters to keep, and I may end up doing
that anyways. I guess I'm just curious to know why that query only
works after using optimize.

I should point out that I'm using KinoSearch 0.15.
Optimize on finish is affecting search results [ In reply to ]
On Aug 1, 2007, at 11:34 AM, Matt wrote:

> Consider a document with the following content:
>
> "salad.robot mercenary"
>
> Just random words that won't be gobbled up by the stop list. Consider
> also that the tokenizing expression just looks for words. The content
> would be split like: "salad|robot|mercenary".

Yes, that's right. Tokenizing is almost certainly unrelated to the
issue you describe.

> salad
> robot
> mercenary
> salad.robot
> "robot mercenary"
> "salad.robot mercenary"
>
> If I then re-index the document without making any changes to the
> content, essentially just remove it and add it, and then call the
> non-optimizing finish(), all of the above queries continue to work
> accept for "salad.robot".

For the record, there is a subtle difference between the way
QueryParser parses 'salad.robot' and the way it parses 'salad
robot'. The first will be interpreted as a phrase.

However, that should not impact the search results pre- and post-
optimize.

> That query does work if I optimize the
> index after re-adding the document, however.

When you delete-by-term, what KS does is mark any documents which
match the term in old segments as "deleted". When you re-add, the
new document ends up in a new segment.

The re-added document ought to be available from the new segment.

When you optimize, KS merges all existing segments into a single new
segment, and documents may be reordered. Search results from the
same index pre- and post-optimize should be identical except for the
order of documents which have identical scores against the search query.

> I guess I'm just curious to know why that query only
> works after using optimize.

The possibility exists that one of KinoSearch's iterators is messing
up and quitting before the last document. Then, when the segments are
merged, the document appears in a new place and KS can find it
again. If that's true, it's a bug.

There may also be some concurrency issues depending on how your
indexing/search apps are set up.

> I should point out that I'm using KinoSearch 0.15.

If we can reduce this to a problem case that I can duplicate locally,
I will try to fix it.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
Optimize on finish is affecting search results [ In reply to ]
>
> If we can reduce this to a problem case that I can duplicate locally,
> I will try to fix it.
>

I've attached a test script that should reproduce the behavior. It's
not exactly trivial, but you may be able to distil it further. Please
note that it will create an index in /tmp the first time you run it.

I recommend running the script multiple times as the test results are
initially inconsistent (for me, at least). It's entirely possible
that I'm just doing something weird or unconventional, though. :)

Thanks, I really appreciate your input on this.


Matt
-------------- next part --------------
A non-text attachment was scrubbed...
Name: test_search_sans_optimize.pl
Type: application/x-perl
Size: 2534 bytes
Desc: not available
Url : http://www.rectangular.com/pipermail/kinosearch/attachments/20070802/3aac9218/test_search_sans_optimize.bin