Mailing List Archive

I am getting an exception in ComplexPhraseQueryParser when fuzzy searching
I am using Lucene 8.2, but have also verified this on 8.9 and 8.10.1.
My query string is either ""by~1 word~1"", or ""ky~1 word~1"".
I am looking for a phrase of these 2 words, with potential 1 character misspelling, or fuzziness.
I realize that 'by' is usually a stop word, that is why I also tested with 'ky'.
My simplified test content is either "AC-2.b word", "AC-2.k word", "AC-2.y word".
The first part of the test content is pulled from actual data my customers are trying to search.
For the query with 'by~1' the exception occurs if the content has '.b' or .y', but not '.k'
For the query with 'ky~1' the exception occurs if the content has '.k' or .y', but not '.b'
Here is the test code:
import java.io.IOException;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.core.*;
import org.apache.lucene.analysis.standard.*;
import org.apache.lucene.analysis.tokenattributes.*;
import org.apache.lucene.analysis.util.*;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.FieldType;
import org.apache.lucene.index.DirectoryReader;
import org.apache.lucene.index.IndexOptions;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.queryparser.classic.ParseException;
import org.apache.lucene.queryparser.complexPhrase.ComplexPhraseQueryParser;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.TopDocs;
import org.apache.lucene.store.RAMDirectory;

public class phraseTest {

public static Analyzer analyzer = new StandardAnalyzer();
public static IndexWriterConfig config = new IndexWriterConfig(
analyzer);
public static RAMDirectory ramDirectory = new RAMDirectory();
public static IndexWriter indexWriter;
public static Query queryToSearch = null;
public static IndexReader idxReader;
public static IndexSearcher idxSearcher;
public static TopDocs hits;
public static String query_field = "Content";

// Pick only one content string
// public static String content = "AC-2.b word";
public static String content = "AC-2.k word";
// public static String content = "AC-2.y word";

// Pick only one query string
// public static String queryString = "\"by~1 word~1\"";
public static String queryString = "\"ky~1 word~1\"";

@SuppressWarnings("deprecation")
public static void main(String[] args) throws IOException {

System.out.println("Content is\n " + content);
System.out.println("Query field is " + query_field);
System.out.println("Query String is '" + queryString + "'");

Document doc = new Document(); // create a new document

/**
* Create a field with term vector enabled
*/
FieldType type = new FieldType();
type.setIndexOptions(IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS);
type.setStored(true);
type.setStoreTermVectors(true);
type.setTokenized(true);
type.setStoreTermVectorOffsets(true);

//term vector enabled
Field cField = new Field(query_field, content, type);
doc.add(cField);

try {
indexWriter = new IndexWriter(ramDirectory, config);
indexWriter.addDocument(doc);
indexWriter.close();

idxReader = DirectoryReader.open(ramDirectory);
idxSearcher = new IndexSearcher(idxReader);
ComplexPhraseQueryParser qp =
new ComplexPhraseQueryParser(query_field, analyzer);
queryToSearch = qp.parse(queryString);

// Here is where the searching, etc starts
hits = idxSearcher.search(queryToSearch, idxReader.maxDoc());
System.out.println("scoreDoc size: " + hits.scoreDocs.length);

// highlight the hits ...

} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (ParseException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}

}
}

Here is the exception (using Lucene 8.2):

Exception in thread "main" java.lang.IllegalArgumentException: Unknown query type "org.apache.lucene.search.ConstantScoreQuery" found in phrase query string "ky~1 word~1"
at org.apache.lucene.queryparser.complexPhrase.ComplexPhraseQueryParser$ComplexPhraseQuery.rewrite(ComplexPhraseQueryParser.java:325)
at org.apache.lucene.search.IndexSearcher.rewrite(IndexSearcher.java:666)
at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:439)
at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:564)
at org.apache.lucene.search.IndexSearcher.searchAfter(IndexSearcher.java:416)
at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:427)
at phraseTest.main(phraseTest.java:79)`

Am I using ComplexPhraseQueryParser wrong?
Is this a bug in Lucene?

I have also tested this with a query string like ""dog~2 word~1"".
This causes the same exception if the content has ‘.d’, ‘.o’, or ‘.g’.

Looks like a fuzzy term that reduces to 1 character runs into trouble when encountering a matching single character term in the content.

Thanks in advance for any suggestions, or guidance,

David Shifflett
I am getting an exception in ComplexPhraseQueryParser when fuzzy searching [ In reply to ]
I am using Lucene 8.2, but have also verified this on 8.9 and 8.10.1.

My query string is either ""by~1 word~1"", or ""ky~1 word~1"".
I am looking for a phrase of these 2 words, with potential 1 character misspelling, or fuzziness.
I realize that 'by' is usually a stop word, that is why I also tested with 'ky'.

My simplified test content is either "AC-2.b word", "AC-2.k word", "AC-2.y word".

The first part of the test content is pulled from actual data my customers are trying to search.
For the query with 'by~1' the exception occurs if the content has '.b' or .y', but not '.k'
For the query with 'ky~1' the exception occurs if the content has '.k' or .y', but not '.b'

Here is the test code:
import java.io.IOException;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.core.*;
import org.apache.lucene.analysis.standard.*;
import org.apache.lucene.analysis.tokenattributes.*;
import org.apache.lucene.analysis.util.*;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.FieldType;
import org.apache.lucene.index.DirectoryReader;
import org.apache.lucene.index.IndexOptions;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.queryparser.classic.ParseException;
import org.apache.lucene.queryparser.complexPhrase.ComplexPhraseQueryParser;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.TopDocs;
import org.apache.lucene.store.RAMDirectory;

public class phraseTest {

public static Analyzer analyzer = new StandardAnalyzer();
public static IndexWriterConfig config = new IndexWriterConfig(
analyzer);
public static RAMDirectory ramDirectory = new RAMDirectory();
public static IndexWriter indexWriter;
public static Query queryToSearch = null;
public static IndexReader idxReader;
public static IndexSearcher idxSearcher;
public static TopDocs hits;
public static String query_field = "Content";

// Pick only one content string
// public static String content = "AC-2.b word";
public static String content = "AC-2.k word";
// public static String content = "AC-2.y word";

// Pick only one query string
// public static String queryString = "\"by~1 word~1\"";
public static String queryString = "\"ky~1 word~1\"";

@SuppressWarnings("deprecation")
public static void main(String[] args) throws IOException {

System.out.println("Content is\n " + content);
System.out.println("Query field is " + query_field);
System.out.println("Query String is '" + queryString + "'");

Document doc = new Document(); // create a new document

/**
* Create a field with term vector enabled
*/
FieldType type = new FieldType();
type.setIndexOptions(IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS);
type.setStored(true);
type.setStoreTermVectors(true);
type.setTokenized(true);
type.setStoreTermVectorOffsets(true);

//term vector enabled
Field cField = new Field(query_field, content, type);
doc.add(cField);

try {
indexWriter = new IndexWriter(ramDirectory, config);
indexWriter.addDocument(doc);
indexWriter.close();

idxReader = DirectoryReader.open(ramDirectory);
idxSearcher = new IndexSearcher(idxReader);
ComplexPhraseQueryParser qp =
new ComplexPhraseQueryParser(query_field, analyzer);
queryToSearch = qp.parse(queryString);

// Here is where the searching, etc starts
hits = idxSearcher.search(queryToSearch, idxReader.maxDoc());
System.out.println("scoreDoc size: " + hits.scoreDocs.length);

// highlight the hits ...

} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (ParseException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}

}
}

Here is the exception (using Lucene 8.2):

Exception in thread "main" java.lang.IllegalArgumentException: Unknown query type "org.apache.lucene.search.ConstantScoreQuery" found in phrase query string "ky~1 word~1"
at org.apache.lucene.queryparser.complexPhrase.ComplexPhraseQueryParser$ComplexPhraseQuery.rewrite(ComplexPhraseQueryParser.java:325)
at org.apache.lucene.search.IndexSearcher.rewrite(IndexSearcher.java:666)
at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:439)
at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:564)
at org.apache.lucene.search.IndexSearcher.searchAfter(IndexSearcher.java:416)
at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:427)
at phraseTest.main(phraseTest.java:79)`

Am I using ComplexPhraseQueryParser wrong?
Is this a bug in Lucene?

I have also tested this with a query string like ""dog~2 word~1"".
This causes the same exception if the content has ‘.d’, ‘.o’, or ‘.g’.

Looks like a fuzzy term that reduces to 1 character runs into trouble when encountering a matching single character term in the content.

Thanks in advance for any suggestions, or guidance,

David Shifflett