Mailing List Archive

Probably found bug in GraphTokenStreamFiniteStrings
Hi everyone,

I faced with some exceptions in my production service based on Lucene,
after some investigation I have found the problem and build minimal
example as test for GraphTokenStreamFiniteStrings (you can add this
into TestGraphTokenStreamFiniteStrings):

========================================

public void testX() throws IOException {
CannedTokenStream cts =
new CannedTokenStream(
token("???????2", 1, 2),
token("???????", 2, 1),
token("???????", 0, 1),
token("2", 1, 1)

);
GraphTokenStreamFiniteStrings graph = new
GraphTokenStreamFiniteStrings(cts);

assertTrue(graph.getTerms("", 0).length > 0);
}

========================================

Currently this code fails on assertion in line:
org.apache.lucene.util.automaton.Automaton.initTransition(Automaton.java:484)
with message "state=0 nextState=0".

I have run it on last master revision: eba0e255352adb2cb72031699c3f8d3963286d89

In production we use 8.4.1.

Moreover I think the problem somewhere in Operations#removeDeadStates
code, because if I just remove this operation from
GraphTokenStreamFiniteStrings constructor, test would pass:

========================================

public GraphTokenStreamFiniteStrings(TokenStream in) throws IOException {
Automaton aut = build(in);
this.det = aut;
//Operations.removeDeadStates(Operations.determinize(aut,
DEFAULT_MAX_DETERMINIZED_STATES));
}

========================================


Have to say I'm not totally sure is it bug in
GraphTokenStreamFiniteStrings or in analyzer which produce such
TokenStream.


Originally user sent misspelled phrase "???????2" with missed space
before '2'. And our analyzer did some morphology work this is how
"???????" and "???????" arrived.

This is more information about TokenStream:

========================================

term: ???????2, type: <ALPHANUM>, startOffset:0, endOffset:8,
posInc:1, posLength:2,
term: ???????, type: SYNONYM, startOffset:0, endOffset:7, posInc:2,
posLength:1,
term: ???????, type: <ALPHANUM>, startOffset:0, endOffset:7, posInc:0,
posLength:1,
term: 2, type: <ALPHANUM>, startOffset:7, endOffset:8, posInc:1, posLength:1

========================================


If you need some extra information just let me know.

Also I'm ready to filing issues in JIRA if it's bug in
GraphTokenStreamFiniteStrings.

So what do you think about it?


- Alexander Menshikov