Mailing List Archive

Possible Phrase Query Bug
I think I've found a bug with phrase queries. (The RT queue did not
look too active so I am mailing it to the list as well).

Here's the reproduction strategy:

1) Create a PolyAnalyzer
2) Create an indexer w/ the above analyzer
3) Add a document w/ a text field of 'zzz xxx yyy zzz'
4) finish() the index
5) create a searcher w/ the above analyzer
6) Do the following phrase search: "yyy zzz"
7) EXPECTED: 1 result. GOT: 0 results

I have attached a script which demonstrates this issue with
KinoSearch 0.15. I think the first "zzz" in the body of the text
field is tripping up the phrase scoring, but that's just a guess.

Is this behavior expected? If not, is it a known bug? I looked in
RT and searched the mailing lists but nothing obviously related came
back.

Here is my environment:

* OS: Ubuntu Dapper 6.01 GNU/Linux
* CPU: 2.0Ghz quad-core Intel Xeon (64 bit)
* Kernel: 2.6.15-28 (Debian package 2.6.15-28-amd64-xeon)
* Index Filesystem: ext3 (part of a logical volume in LVM)
* /tmp Filesystem: ext3
* KinoSearch: 0.15 (unpatched, from CPAN)
* Perl: 5.8.7 (stock Perl on Ubuntu Dapper 6.01)
* gcc: 4.03
* ld: 2.16.91

-matthew
Re: Possible Phrase Query Bug [ In reply to ]
Thanks for the detailed bug report Matthew. I've been playing with
phrase scoring recently, so I'll take a look at this. I don't have a
quad-core Xeon to try it out on :) but I'll see if I can reproduce it
on what I've got and report back tonight.

Nathan Kurz
nate@verse.com

On 9/7/07, Matthew O'Connor <matthew.oconnor@socialtext.com> wrote:
> I think I've found a bug with phrase queries. (The RT queue did not
> look too active so I am mailing it to the list as well).

_______________________________________________
KinoSearch mailing list
KinoSearch@rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch
Re: Possible Phrase Query Bug [ In reply to ]
On 9/7/07, Nathan Kurz <nate@verse.com> wrote:
> I'll see if I can reproduce it on what I've got and report back tonight.
>
> On 9/7/07, Matthew O'Connor <matthew.oconnor@socialtext.com> wrote:
> > I think I've found a bug with phrase queries. (The RT queue did not
> > look too active so I am mailing it to the list as well).

It's a real bug, and also exists in current svn. I've think I've
found the problem, but I'm a little to sleepy to fix it correctly
tonight. I'll try to post a patch tomorrow.

Nathan Kurz
nate@verse.com

ps. Is there a good way to convince the Build script to temporarily
avoid optimizing so GDB is easier to use? I can add things via
buildlib/Lucy/Build.pm's $EXTRA_CCFLAGS, but I'm not sure how to
override the options I compiled Perl with. Do I have to temporarily
edit my Config.pm or is there an easier way?

_______________________________________________
KinoSearch mailing list
KinoSearch@rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch
Re: Possible Phrase Query Bug [ In reply to ]
On 9/7/07, Matthew O'Connor <matthew.oconnor@socialtext.com> wrote:
> I think I've found a bug with phrase queries.

Attached is a patch against SVN that should solve this. The problem
was an unsigned integer underflow if the first occurrence of a term
was at a position less than its phrase offset.

I've fixed it by changing to using from a u32_t to an i32_t and
casting the compare, which seemed like the least invasive change. It
does slightly reduce the range of legal positions to 31-bits, but I
don't think this is going to catch anyone. I didn't look at the .15
version, but presume the problem can be fixed in the same simple
manner.

There are probably more invasive alternatives if the range reduction
is unacceptable. Also, it looks possible to speed up the loop a bit,
but probably at the cost of readability. I'll hold off on trying this
unless it's thought to be a bottleneck.

Nathan Kurz
nate@verse.com
Re: Possible Phrase Query Bug [ In reply to ]
On Sep 8, 2007, at 11:28 AM, Nathan Kurz wrote:

> On 9/7/07, Matthew O'Connor <matthew.oconnor@socialtext.com> wrote:
>> I think I've found a bug with phrase queries.
>
> Attached is a patch against SVN that should solve this. The problem
> was an unsigned integer underflow if the first occurrence of a term
> was at a position less than its phrase offset.

Thanks Nathan!

FYI, your patch did not apply cleanly against SVN trunk. However,
since the patch was small it was obvious how to apply it by hand.
Once I did so all the tests passed.

I then backported the change (via cargo culting) to KS 0.15 (using
tag 'release-0.15' as a reference). The tests pass there too and my
demonstration script no longer exhibits the problem. I have attached
this patch against 0.15.

Would it be possible to get another release of the KinoSearch 0.15-
line with this patch included? The current trunk seems wildly
different and I'd be nervous about upgrading to it right away.

-matthew
Re: Possible Phrase Query Bug [ In reply to ]
On Sep 8, 2007, at 11:28 AM, Nathan Kurz wrote:

> On 9/7/07, Matthew O'Connor <matthew.oconnor@socialtext.com> wrote:
>> I think I've found a bug with phrase queries.

Matthew, excellent job producing a repeatable test scenario.

> The problem
> was an unsigned integer underflow if the first occurrence of a term
> was at a position less than its phrase offset.

Nathan, excellent job identifying and fixing the bug.

> I've fixed it by changing to using from a u32_t to an i32_t and
> casting the compare, which seemed like the least invasive change.

I appreciate the light touch. :)

What I actually committed was slightly different, and attempts to
explain the algorithm more effectively:

/* Discard positions that occur too early in the field
to match as
* a part of the phrase. For example, if the field
begins with
* "The ants go marching one by one", that initial "The"
cannot
* match as the second term in a phrase search for
* "fight the power".
*/
target = phrase_offset;
while (candidates < candidates_end && *candidates <
target) {
candidates++;
}
if (candidates == candidates_end)
break;

> It does slightly reduce the range of legal positions to 31-bits, but I
> don't think this is going to catch anyone.

It's already limited to that by code elsewhere, and such a situation
is pretty esoteric in any case. With ordinary analyzers, you'd need
to have a field value that was longer than 2 GB for that to come into
play at all, and even then that assumes each byte is a token holding
down a position.

Where people might theoretically have gotten tripped up is via
accidentally running out of room after $token->set_pos_inc
($large_number) during indexing. I don't think limiting the range of
usable positions to 31 bits impedes usability, but *checking* the
range, as this bug illustrates, is a good idea. I've added one such
check with revision 2526.

> I didn't look at the .15
> version, but presume the problem can be fixed in the same simple
> manner.

Yes. The algorithm hadn't changed.

I committed the changes to trunk as 2524 and to maint as 2525. Thanks!

> There are probably more invasive alternatives if the range reduction
> is unacceptable. Also, it looks possible to speed up the loop a bit,
> but probably at the cost of readability. I'll hold off on trying this
> unless it's thought to be a bottleneck.

I'd be surprised if it were. This code is already significantly
different from the Lucene PhraseScorer code and theoretically faster.

In Lucene, each position is iterated over with a method call to
termPositions.nextPosition(), and objects representing each term in
the phrase are kept in a priority queue. In KS, we load all the
positions at once, then traverse the arrays with pointer math, using
an "anchor set" algorithm not in Lucene.

The Lucene technique uses less memory. The KS technique has less
function-call overhead. I'm not certain whether that was the right
trade to make historically, but I am nearly certain that there's no
way to do what we want to do with expanded position scoring without
keeping all positions loaded.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/



_______________________________________________
KinoSearch mailing list
KinoSearch@rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch
Re: Possible Phrase Query Bug [ In reply to ]
> What I actually committed was slightly different, and attempts to
> explain the algorithm more effectively:

I checked out the current version, and what you did makes sense. You
might consider moving the new check to the outer loop, though, since
you don't have to check the initial conditions for every potential
anchor. But apart from a couple extra compares, your version seems
fine.

> function-call overhead. I'm not certain whether that was the right
> trade to make historically, but I am nearly certain that there's no
> way to do what we want to do with expanded position scoring without
> keeping all positions loaded.

I think you are right, and your version makes much more sense to me
than the Lucene version you describe. One needs full access to the
positions. My instinct involves pushing things even farther in the
direction you took.

Nathan Kurz
nate@verse.com

_______________________________________________
KinoSearch mailing list
KinoSearch@rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch
Re: Possible Phrase Query Bug [ In reply to ]
On Sep 8, 2007, at 12:47 PM, Matthew O'Connor wrote:

> Would it be possible to get another release of the KinoSearch 0.15-
> line with this patch included? The current trunk seems wildly
> different and I'd be nervous about upgrading to it right away.

Yes, I'm working on that.

Right now I'm trying to fix KS to work with bleadperl, since 5.10 is
near. KS itself is alright, but reveals a puzzling, critical bug in
Clone <http://rt.cpan.org//Ticket/Display.html?id=29270>. It took me
a while to track it down yesterday.

To fix it, I'm going to do one of my favorite things: eliminate a non-
core dependency. :) The only ones left will be...

* Compress::Zlib, which went into core as of 5.9.3.
* HTML::Parser, which many, many CPAN distros depend on.
* Lingua::StopWords, which I maintain.
* Lingua::Stem::Snowball, which I maintain.

Once I have that and a couple other items taken care of, including
your QueryParser patch, I'll release 0.16.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/



_______________________________________________
KinoSearch mailing list
KinoSearch@rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch
Re: Possible Phrase Query Bug [ In reply to ]
On Sep 10, 2007, at 12:13 PM, Marvin Humphrey wrote:

> Once I have that and a couple other items taken care of, including
> your QueryParser patch, I'll release 0.16.

Thank you Marvin! It has been a real joy to work with you on
KinoSearch stuff. You're reasonably responsive which is very helpful :)

(Thank you again Nathan, for fixing that PhraseScorer bug!)

-matthew

_______________________________________________
KinoSearch mailing list
KinoSearch@rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch
Re: Possible Phrase Query Bug [ In reply to ]
On 9/9/07, Marvin Humphrey <marvin@rectangular.com> wrote:
> What I actually committed was slightly different, and attempts to
> explain the algorithm more effectively:

I looked at your implementation more closely, and had a few concerns
that I couldn't figure out the answers for.

1) What happens when phrase_offsets[0] is greater than the first
occurrence of the anchor_set? It seems like there is going to be
another underflow problem, although it doesn't seem to cause problems
when I test for it.

2) I think we continue going through the outer loop even if we have
run out of anchors. Again, this doesn't seem to cause problems, but
seems suboptimal.

I went ahead and wrote what I think is a more efficient implementation
of the anchor searching. It may not be a good patch to apply, but
it's the closest thing I've had to an incremental improvement in a
while, so I thought I should send it along. I'll attach it to future
message after a little more cleanup.

Nathan Kurz
nate@verse.com

_______________________________________________
KinoSearch mailing list
KinoSearch@rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch
Re: Possible Phrase Query Bug [ In reply to ]
On Sep 10, 2007, at 5:01 PM, Nathan Kurz wrote:

> 1) What happens when phrase_offsets[0] is greater than the first
> occurrence of the anchor_set?

I'm not sure what you mean by occurrence. Do you mean the first
position in the anchor set?

> It seems like there is going to be
> another underflow problem, although it doesn't seem to cause problems
> when I test for it.

Can you please send your test code so I can see what you're concerned
about?

> 2) I think we continue going through the outer loop even if we have
> run out of anchors.

You're right. We could add a break after setting anchor_set->len.

/* winnow down the size of the anchor set */
anchor_set->len = (char*)new_anchors - (char*)anchors_start;
+
+ /* Bail if we've exhausted all positions for the rarest term. */
+ if (anchor_set->len == 0)
+ break;
}

> Again, this doesn't seem to cause problems, but seems suboptimal.

I doubt it has the any impact on performance, but I like it because
it's the kind of thing a human would do -- stop when it becomes clear
that no match will be found.

> it's the closest thing I've had to an incremental improvement in a
> while,

I implemented your REFCOUNT_INC idea. :)

http://www.rectangular.com/pipermail/kinosearch-commits/2007-
September/000344.html

That one was a nice incremental change.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/



_______________________________________________
KinoSearch mailing list
KinoSearch@rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch
Re: Possible Phrase Query Bug [ In reply to ]
On 9/10/07, Marvin Humphrey <marvin@rectangular.com> wrote:
>
> On Sep 10, 2007, at 5:01 PM, Nathan Kurz wrote:
>
> > 1) What happens when phrase_offsets[0] is greater than the first
> > occurrence of the anchor_set?
>
> I'm not sure what you mean by occurrence. Do you mean the first
> position in the anchor set?
>
> > It seems like there is going to be
> > another underflow problem, although it doesn't seem to cause problems
> > when I test for it.
>
> Can you please send your test code so I can see what you're concerned
> about?

Sorry for my lack of clarity.

/* create an anchor set */
first_posting = (ScorePosting*)PList_Get_Posting(plists[0]);
BB_Copy_Str(anchor_set, (char*)first_posting->prox,
first_posting->freq * sizeof(u32_t));
anchors_start = (u32_t*)anchor_set->ptr;
anchors = anchors_start;
anchors_end = (u32_t*)BBEND(anchor_set);
while(anchors < anchors_end) {
ASSERT(*anchors > phrase_offset, "anchor underflow");
*anchors++ -= phrase_offset;
}

This ASSERT() will fail on certain phrases. An underflow occurs if
searching the phrase offset for the initial term 'a' is greater than
zero on the document "a b c" since the first occurrence of the term
'a' is at position 0. So far as I can tell, we still always get
correct results when we fall through the loop, but I'm worried that
this is good luck rather than reliable design.

> I implemented your REFCOUNT_INC idea. :)
>
> http://www.rectangular.com/pipermail/kinosearch-commits/2007-
> September/000344.html
>
> That one was a nice incremental change.

I hadn't noticed that you did that. Thanks!

Nathan Kurz
nate@verse.com

_______________________________________________
KinoSearch mailing list
KinoSearch@rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch
Re: Possible Phrase Query Bug [ In reply to ]
On Sep 10, 2007, at 5:50 PM, Nathan Kurz wrote:

> /* create an anchor set */
> first_posting = (ScorePosting*)PList_Get_Posting(plists[0]);
> BB_Copy_Str(anchor_set, (char*)first_posting->prox,
> first_posting->freq * sizeof(u32_t));
> anchors_start = (u32_t*)anchor_set->ptr;
> anchors = anchors_start;
> anchors_end = (u32_t*)BBEND(anchor_set);
> while(anchors < anchors_end) {
> ASSERT(*anchors > phrase_offset, "anchor underflow");
> *anchors++ -= phrase_offset;
> }
>
> This ASSERT() will fail on certain phrases. An underflow occurs if
> searching the phrase offset for the initial term 'a' is greater than
> zero on the document "a b c" since the first occurrence of the term
> 'a' is at position 0. So far as I can tell, we still always get
> correct results when we fall through the loop, but I'm worried that
> this is good luck rather than reliable design.

Yeah, that's not good. You could end up with an anchor set like this...

((2**32 - 3), 62, 190)

Much code assumes that positions are ordered, and that breaks the
assumption.

We want code that will "shift" invalid values off the front of the
anchor set.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/



_______________________________________________
KinoSearch mailing list
KinoSearch@rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch