Mailing List Archive: TermVectorOffsetStrategy producing Passages with matches out of order? (causing IndexOutOfBoundsException)

TermVectorOffsetStrategy producing Passages with matches out of order? (causing IndexOutOfBoundsException)

Jun 29, 2023, 2:55 PM

Post #1 of 3 (181 views)

I've got a user getting java.lang.IndexOutOfBoundsException from the
UnifiedHighlighter in Solr 9.1.0 w/Lucene 9.3.0

(And FWIW, this same data, w/same configs, in 8.11.1, purportedtly didn't
have this problem)

I don't really understand the highlighter code very well, but AFAICT:

- DefaultPassageFormatter seems to assume that the "matches"
inside a single Passage will be "in order" (by offset)
- it accounts for the possibility that they overlap
- but not that matchEnds[i+1] < matchStarts[i]
- but in some cases (i don't understand)
- TermVectorOffsetStrategy can produce Passages that are "reversed"
- aparently based on the iteration order from
OfMatchesIteratorWithSubs ?

Which means DefaultPassageFormatter can trigger IOOBE in StringBuilder..

java.lang.IndexOutOfBoundsException: start 8, end 7, length 16
at java.lang.AbstractStringBuilder.checkRange(Unknown Source) ~[?:?]
at java.lang.AbstractStringBuilder.append(Unknown Source) ~[?:?]
at java.lang.StringBuilder.append(Unknown Source) ~[?:?]
at org.apache.lucene.search.uhighlight.DefaultPassageFormatter.append(DefaultPassageFormatter.java:133) ~[?:?]
at org.apache.lucene.search.uhighlight.DefaultPassageFormatter.format(DefaultPassageFormatter.java:84) ~[?:?]
at org.apache.lucene.search.uhighlight.DefaultPassageFormatter.format(DefaultPassageFormatter.java:25) ~[?:?]
at org.apache.lucene.search.uhighlight.FieldHighlighter.highlightFieldForDoc(FieldHighlighter.java:94) ~[?:?]
at org.apache.lucene.search.uhighlight.UnifiedHighlighter.highlightFieldsAsObjects(UnifiedHighlighter.java:954) ~[?:?]
at org.apache.lucene.search.uhighlight.UnifiedHighlighter.highlightFields(UnifiedHighlighter.java:824) ~[?:?]
at org.apache.solr.highlight.UnifiedSolrHighlighter.doHighlighting(UnifiedSolrHighlighter.java:165) ~[?:?]

...as it tries to append a subsequence based on the start+end of
"overlapping" matches that don't actaully overlap -- the end of the
"i+1" match is just strictly less then the "start" of the "i"
match because of how the Passage was build

I'm still trying to wrap my head around all the moving pieces to
try and reproduce this in a small scale lucene test, but in the meantime I
patched some of the 9.3.0 highlighter code (patch below sig) to include
some debugging output to kind of show what's happening here...

http://localhost:8983/solr/workplace/select?fl=Expertise,id&defType=lucene&df=Expertise&q=machine+learning&hl=true&rows=1&q.op=OR&echoParams=all

nocommit: highlightOffsetsEnums -> OfMatchesIteratorWithSubs(term:learning,[8-16])
nocommit: Passage2030658055.addMatch(8,16,[6c 65 61 72 6e 69 6e 67],1)
nocommit: highlightOffsetsEnums -> OfMatchesIteratorWithSubs(term:machine,[0-7])
nocommit: Passage2030658055.addMatch(0,7,[6d 61 63 68 69 6e 65],1)
nocommit: format([[Passage[0-16]{learning[8-16],machine[0-7]}score=2.7656934]],Machine Learning) <-- class org.apache.lucene.search.uhighlight.TermVectorOffsetStrategy
nocommit: append(,Machine Learning,0,8)
nocommit: append(Machine ,Machine Learning,8,7)
2023-06-29 21:11:15.711 ERROR (qtp1528769018-17) [ x:workplace] o.a.s.h.RequestHandlerBase java.lang.IndexOutOfBoundsException: start 8, end 7, length 16 => java.lang.IndexOutOfBoundsException: start 8, end 7, length 16
at java.base/java.lang.AbstractStringBuilder.checkRange(AbstractStringBuilder.java:1716)
java.lang.IndexOutOfBoundsException: start 8, end 7, length 16
at java.lang.AbstractStringBuilder.checkRange(AbstractStringBuilder.java:1716) ~[?:?]
at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:631) ~[?:?]
at java.lang.StringBuilder.append(StringBuilder.java:217) ~[?:?]
at org.apache.lucene.search.uhighlight.DefaultPassageFormatter.append(DefaultPassageFormatter.java:134) ~[?:?]

..note how the OfMatchesIteratorWithSubs (OffsetEnum) enumerates over the
two terms in this order...

term:learning,[8-16]
term:machine,[0-7]

...and that order is preserved in the final Passage -- leading
DefaultPassageFormatter.format() to decide that the two matches in this
Passage overlap (because the start of match#1 (machine[0-7]) is less then
the end of match#0 (learning[8-16]) ... but they don't overlap, one is
strictly before the other, so it winds up passing StringBuilder.append an
end < start.

* Has anyone seen any failures like this ?
* Is this a bug in DefaultPassageFormatter's assumptions,
or in the ordering produced by the OffsetEnum ?
* Does anyone have a theory where/how the problem might have changed
between 8.11 and 9.3 ?

-Hoss
http://www.lucidworks.com/

diff --git a/lucene/highlighter/src/java/org/apache/lucene/search/uhighlight/DefaultPassageFormatter.java b/lucene/highlighter/src/java/org/apache/lucene/search/uhighlight/DefaultPassageFormatter.java
index 345e2b61316..c82362b5eac 100644
--- a/lucene/highlighter/src/java/org/apache/lucene/search/uhighlight/DefaultPassageFormatter.java
+++ b/lucene/highlighter/src/java/org/apache/lucene/search/uhighlight/DefaultPassageFormatter.java
@@ -102,6 +102,7 @@ public class DefaultPassageFormatter extends PassageFormatter {
* @param end index of the character following the last character in content
*/
protected void append(StringBuilder dest, String content, int start, int end) {
+ System.err.println("nocommit: append("+dest+","+content+","+start+","+end+")");
if (escape) {
// note: these are the rules from owasp.org
for (int i = start; i < end; i++) {
diff --git a/lucene/highlighter/src/java/org/apache/lucene/search/uhighlight/FieldHighlighter.java b/lucene/highlighter/src/java/org/apache/lucene/search/uhighlight/FieldHighlighter.java
index aacb9089e91..eba4e2a6082 100644
--- a/lucene/highlighter/src/java/org/apache/lucene/search/uhighlight/FieldHighlighter.java
+++ b/lucene/highlighter/src/java/org/apache/lucene/search/uhighlight/FieldHighlighter.java
@@ -91,6 +91,7 @@ public class FieldHighlighter {
}

if (passages.length > 0) {
+ System.err.println("nocommit: format(["+java.util.Arrays.toString(passages)+"],"+content+") <-- "+ fieldOffsetStrategy.getClass());
return passageFormatter.format(passages, content);
} else {
return null;
@@ -152,6 +153,8 @@ public class FieldHighlighter {
int lastPassageEnd = 0;

do {
+ System.err.println("nocommit: highlightOffsetsEnums -> " + off.toString());
+
int start = off.startOffset();
if (start == -1) {
throw new IllegalArgumentException(
diff --git a/lucene/highlighter/src/java/org/apache/lucene/search/uhighlight/Passage.java b/lucene/highlighter/src/java/org/apache/lucene/search/uhighlight/Passage.java
index 6fa281bb16c..09cd89dc14b 100644
--- a/lucene/highlighter/src/java/org/apache/lucene/search/uhighlight/Passage.java
+++ b/lucene/highlighter/src/java/org/apache/lucene/search/uhighlight/Passage.java
@@ -41,6 +41,8 @@ public class Passage {

/** @lucene.internal */
public void addMatch(int startOffset, int endOffset, BytesRef term, int termFreqInDoc) {
+ System.err.println("nocommit: Passage"+System.identityHashCode(this)+".addMatch("+startOffset+","+endOffset+","+term+","+termFreqInDoc+")");
+
assert startOffset >= this.startOffset && startOffset <= this.endOffset;
if (numMatches == matchStarts.length) {
int newLength = ArrayUtil.oversize(numMatches + 1, RamUsageEstimator.NUM_BYTES_OBJECT_REF);

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: TermVectorOffsetStrategy producing Passages with matches out of order? (causing IndexOutOfBoundsException) [ In reply to ]

hossman_lucene at fucit

Jun 29, 2023, 3:49 PM

Post #2 of 3 (181 views)

Permalink

With some trial and error I realized two things...

1) the order of the terms in the BooleanQuery seems to matter
- but in terms of their "natural order", not the order in the doc

(which is why i was so confused trying to reproduce it)

2) the problem happens when using termVectors but *NOT* using
termVectorPositions

Test patch below demonstrates problem (applies to branch_9x)

-Hoss
http://www.lucidworks.com/

diff --git a/lucene/highlighter/src/test/org/apache/lucene/search/uhighlight/TestUnifiedHighlighterTermVec.java b/lucene/highlighter/src/test/org/apache/lucene/search/uhighlight/TestUnifiedHighlighterTermVec.java
index 341318739f1..b94d60c3f85 100644
--- a/lucene/highlighter/src/test/org/apache/lucene/search/uhighlight/TestUnifiedHighlighterTermVec.java
+++ b/lucene/highlighter/src/test/org/apache/lucene/search/uhighlight/TestUnifiedHighlighterTermVec.java
@@ -76,6 +76,51 @@ public class TestUnifiedHighlighterTermVec extends LuceneTestCase {
dir.close();
}

+ public void testTermVecButNoPositions1() throws Exception {
+ testTermVecButNoPositions("x", "y", "y x", "y x");
+ }
+ public void testTermVecButNoPositions2() throws Exception {
+ testTermVecButNoPositions("y", "x", "y x", "y x");
+ }
+ public void testTermVecButNoPositions3() throws Exception {
+ testTermVecButNoPositions("zzz", "yyy", "zzz yyy", "zzz yyy");
+ }
+ public void testTermVecButNoPositions4() throws Exception {
+ testTermVecButNoPositions("zzz", "yyy", "yyy zzz", "yyy zzz");
+ }
+ public void testTermVecButNoPositions(String aaa, String bbb,
+ String indexed, String expected) throws Exception {
+
+ final FieldType tvNoPosType = new FieldType(org.apache.lucene.document.TextField.TYPE_STORED);
+ tvNoPosType.setStoreTermVectors(true);
+ // tvNoPosType.setStoreTermVectorPositions(true); // cause of problem seems to be lack of positions
+ tvNoPosType.setStoreTermVectorOffsets(true);
+ tvNoPosType.freeze();
+
+ RandomIndexWriter iw = new RandomIndexWriter(random(), dir, indexAnalyzer);
+
+ Field body = new Field("body", indexed, tvNoPosType);
+ Document document = new Document();
+ document.add(body);
+ iw.addDocument(document);
+ try (IndexReader ir = iw.getReader()) {
+ iw.close();
+ IndexSearcher searcher = newSearcher(ir);
+ BooleanQuery query =
+ new BooleanQuery.Builder()
+ // WTF? order of the terms in the boolean query also matters?
+ .add(new TermQuery(new Term("body", aaa)), BooleanClause.Occur.MUST)
+ .add(new TermQuery(new Term("body", bbb)), BooleanClause.Occur.MUST)
+ .build();
+ TopDocs topDocs = searcher.search(query, 10);
+ assertEquals(1, topDocs.totalHits.value);
+ UnifiedHighlighter highlighter = UnifiedHighlighter.builder(searcher, indexAnalyzer).build();
+ String[] snippets = highlighter.highlight("body", query, topDocs, 2);
+ assertEquals(1, snippets.length);
+ assertTrue(snippets[0], snippets[0].contains(expected));
+ }
+ }
+
public void testFetchTermVecsOncePerDoc() throws IOException {
RandomIndexWriter iw = new RandomIndexWriter(random(), dir, indexAnalyzer);

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: TermVectorOffsetStrategy producing Passages with matches out of order? (causing IndexOutOfBoundsException) [ In reply to ]

hossman_lucene at fucit

Jul 4, 2023, 6:51 PM

Post #3 of 3 (175 views)

Permalink

I hacked up the test a bit so it would compile against 9.0 and confirmed
the problem existed there as well.

So going back a little farther with some manual bisection (to account for
the transition from ant to gradle) lead me to the following...

# first bad commit: [2719cf6630eb2bd7cb37d0e8462dc912d8fafd83]
LUCENE-9431: UnifiedHighlighter WEIGHT_MATCHES is now true by default
(#362)

...my impression here is that this probably must have existed for a
while somwhere in a 'WEIGHT_MATCHES' code path, and this commit just
exposed the probably "by default".

That impression seemed to be confirmed by tweaking my test patch (against
2719cf6630eb2bd7cb37d0e8462dc912d8fafd83) to use...

UnifiedHighlighter highlighter = new UnifiedHighlighter(searcher, indexAnalyzer) {
@Override
protected Set<HighlightFlag> getFlags(String field) {
final Set<HighlightFlag> x = new java.util.HashSet<>(super.getFlags(field));
x.remove(HighlightFlag.WEIGHT_MATCHES);
return x;
}
};

...and the tests started to pass.

Again, i don't really understand this code, but: Knowing that the probably
happens when TermVectorOffsetStrategy means that usages of WEIGHT_MATCHES
in getOffsetStrategy's ANALYSIS codepath probably aren't relevant -- which
leands me to assume the source of the problem is
probably FieldOffsetStrategy.createOffsetsEnumsWeightMatcher ?

But this brings me back to not really understanding what code is "at
fault" here ? ... The existence of WEIGHT_MATCHES and the design of
FieldOffsetStrategy.createOffsetsEnumsWeightMatcher to return an
OffsetsEnum ordered by the "weighted" matches implies that it's
expected/allowed for the offsets in Passages to be out of (ordinal) order
... so does that mean DefaultPassageFormatter is broken for not
expecting this?

-Hoss
http://www.lucidworks.com/

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org