I am using Lucene 8.2, but have also verified this on 8.9 and 8.10.1.
My query string is either ""by~1 word~1"", or ""ky~1 word~1"".
I am looking for a phrase of these 2 words, with potential 1 character misspelling, or fuzziness.
I realize that 'by' is usually a stop word, that is why I also tested with 'ky'.
My simplified test content is either "AC-2.b word", "AC-2.k word", "AC-2.y word".
The first part of the test content is pulled from actual data my customers are trying to search.
For the query with 'by~1' the exception occurs if the content has '.b' or .y', but not '.k'
For the query with 'ky~1' the exception occurs if the content has '.k' or .y', but not '.b'
Here is the test code:
import java.io.IOException;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.core.*;
import org.apache.lucene.analysis.standard.*;
import org.apache.lucene.analysis.tokenattributes.*;
import org.apache.lucene.analysis.util.*;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.FieldType;
import org.apache.lucene.index.DirectoryReader;
import org.apache.lucene.index.IndexOptions;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.queryparser.classic.ParseException;
import org.apache.lucene.queryparser.complexPhrase.ComplexPhraseQueryParser;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.TopDocs;
import org.apache.lucene.store.RAMDirectory;
public class phraseTest {
public static Analyzer analyzer = new StandardAnalyzer();
public static IndexWriterConfig config = new IndexWriterConfig(
analyzer);
public static RAMDirectory ramDirectory = new RAMDirectory();
public static IndexWriter indexWriter;
public static Query queryToSearch = null;
public static IndexReader idxReader;
public static IndexSearcher idxSearcher;
public static TopDocs hits;
public static String query_field = "Content";
// Pick only one content string
// public static String content = "AC-2.b word";
public static String content = "AC-2.k word";
// public static String content = "AC-2.y word";
// Pick only one query string
// public static String queryString = "\"by~1 word~1\"";
public static String queryString = "\"ky~1 word~1\"";
@SuppressWarnings("deprecation")
public static void main(String[] args) throws IOException {
System.out.println("Content is\n " + content);
System.out.println("Query field is " + query_field);
System.out.println("Query String is '" + queryString + "'");
Document doc = new Document(); // create a new document
/**
* Create a field with term vector enabled
*/
FieldType type = new FieldType();
type.setIndexOptions(IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS);
type.setStored(true);
type.setStoreTermVectors(true);
type.setTokenized(true);
type.setStoreTermVectorOffsets(true);
//term vector enabled
Field cField = new Field(query_field, content, type);
doc.add(cField);
try {
indexWriter = new IndexWriter(ramDirectory, config);
indexWriter.addDocument(doc);
indexWriter.close();
idxReader = DirectoryReader.open(ramDirectory);
idxSearcher = new IndexSearcher(idxReader);
ComplexPhraseQueryParser qp =
new ComplexPhraseQueryParser(query_field, analyzer);
queryToSearch = qp.parse(queryString);
// Here is where the searching, etc starts
hits = idxSearcher.search(queryToSearch, idxReader.maxDoc());
System.out.println("scoreDoc size: " + hits.scoreDocs.length);
// highlight the hits ...
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (ParseException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
}
Here is the exception (using Lucene 8.2):
Exception in thread "main" java.lang.IllegalArgumentException: Unknown query type "org.apache.lucene.search.ConstantScoreQuery" found in phrase query string "ky~1 word~1"
at org.apache.lucene.queryparser.complexPhrase.ComplexPhraseQueryParser$ComplexPhraseQuery.rewrite(ComplexPhraseQueryParser.java:325)
at org.apache.lucene.search.IndexSearcher.rewrite(IndexSearcher.java:666)
at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:439)
at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:564)
at org.apache.lucene.search.IndexSearcher.searchAfter(IndexSearcher.java:416)
at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:427)
at phraseTest.main(phraseTest.java:79)`
Am I using ComplexPhraseQueryParser wrong?
Is this a bug in Lucene?
I have also tested this with a query string like ""dog~2 word~1"".
This causes the same exception if the content has ‘.d’, ‘.o’, or ‘.g’.
Looks like a fuzzy term that reduces to 1 character runs into trouble when encountering a matching single character term in the content.
Thanks in advance for any suggestions, or guidance,
David Shifflett
My query string is either ""by~1 word~1"", or ""ky~1 word~1"".
I am looking for a phrase of these 2 words, with potential 1 character misspelling, or fuzziness.
I realize that 'by' is usually a stop word, that is why I also tested with 'ky'.
My simplified test content is either "AC-2.b word", "AC-2.k word", "AC-2.y word".
The first part of the test content is pulled from actual data my customers are trying to search.
For the query with 'by~1' the exception occurs if the content has '.b' or .y', but not '.k'
For the query with 'ky~1' the exception occurs if the content has '.k' or .y', but not '.b'
Here is the test code:
import java.io.IOException;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.core.*;
import org.apache.lucene.analysis.standard.*;
import org.apache.lucene.analysis.tokenattributes.*;
import org.apache.lucene.analysis.util.*;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.FieldType;
import org.apache.lucene.index.DirectoryReader;
import org.apache.lucene.index.IndexOptions;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.queryparser.classic.ParseException;
import org.apache.lucene.queryparser.complexPhrase.ComplexPhraseQueryParser;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.TopDocs;
import org.apache.lucene.store.RAMDirectory;
public class phraseTest {
public static Analyzer analyzer = new StandardAnalyzer();
public static IndexWriterConfig config = new IndexWriterConfig(
analyzer);
public static RAMDirectory ramDirectory = new RAMDirectory();
public static IndexWriter indexWriter;
public static Query queryToSearch = null;
public static IndexReader idxReader;
public static IndexSearcher idxSearcher;
public static TopDocs hits;
public static String query_field = "Content";
// Pick only one content string
// public static String content = "AC-2.b word";
public static String content = "AC-2.k word";
// public static String content = "AC-2.y word";
// Pick only one query string
// public static String queryString = "\"by~1 word~1\"";
public static String queryString = "\"ky~1 word~1\"";
@SuppressWarnings("deprecation")
public static void main(String[] args) throws IOException {
System.out.println("Content is\n " + content);
System.out.println("Query field is " + query_field);
System.out.println("Query String is '" + queryString + "'");
Document doc = new Document(); // create a new document
/**
* Create a field with term vector enabled
*/
FieldType type = new FieldType();
type.setIndexOptions(IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS);
type.setStored(true);
type.setStoreTermVectors(true);
type.setTokenized(true);
type.setStoreTermVectorOffsets(true);
//term vector enabled
Field cField = new Field(query_field, content, type);
doc.add(cField);
try {
indexWriter = new IndexWriter(ramDirectory, config);
indexWriter.addDocument(doc);
indexWriter.close();
idxReader = DirectoryReader.open(ramDirectory);
idxSearcher = new IndexSearcher(idxReader);
ComplexPhraseQueryParser qp =
new ComplexPhraseQueryParser(query_field, analyzer);
queryToSearch = qp.parse(queryString);
// Here is where the searching, etc starts
hits = idxSearcher.search(queryToSearch, idxReader.maxDoc());
System.out.println("scoreDoc size: " + hits.scoreDocs.length);
// highlight the hits ...
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (ParseException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
}
Here is the exception (using Lucene 8.2):
Exception in thread "main" java.lang.IllegalArgumentException: Unknown query type "org.apache.lucene.search.ConstantScoreQuery" found in phrase query string "ky~1 word~1"
at org.apache.lucene.queryparser.complexPhrase.ComplexPhraseQueryParser$ComplexPhraseQuery.rewrite(ComplexPhraseQueryParser.java:325)
at org.apache.lucene.search.IndexSearcher.rewrite(IndexSearcher.java:666)
at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:439)
at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:564)
at org.apache.lucene.search.IndexSearcher.searchAfter(IndexSearcher.java:416)
at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:427)
at phraseTest.main(phraseTest.java:79)`
Am I using ComplexPhraseQueryParser wrong?
Is this a bug in Lucene?
I have also tested this with a query string like ""dog~2 word~1"".
This causes the same exception if the content has ‘.d’, ‘.o’, or ‘.g’.
Looks like a fuzzy term that reduces to 1 character runs into trouble when encountering a matching single character term in the content.
Thanks in advance for any suggestions, or guidance,
David Shifflett