Mailing List Archive

Need some guidance for multi-term synonym matching
I am using Lucene 8.6.3 in an application which searches a library of
technical documentation. I have implemented synonym matching which works for
single word replacements, but does not match when one of the synonyms has
two or more words. My attempts to support multi-term synonyms are failing,
and although I'm sure one of the reasons is that I don't really know what
I'm doing, it hasn't helped that this seems to be an area where regular
changes in the Lucene implementation have occurred and a lot of the examples
on the web are out of date.



I have a custom TechTokenFilter which splits input on whitespace into words,
numbers, and individual characters like + and :.

I have a custom analyzer which chains a WhitespaceTokenizer, a
LowerCaseFilter, my TechTokenFilter, a SynonymGraphFilter and a
FlattenGraphFilter.

The full analyzer is used when indexing the documents, while the final two
filters are dropped when querying the index.



The SynonymGraphFilter loads synonyms from a text file which contains
multiple lines: word,synonym[,synonym .]

For example

analyze,analyse

chart,graph,plot

enquire,inquire



While I'm sure this code could be improved the gist of it is as follows
(I've left try/catch and validation code and the like out to save space):



/* ******************* CODE *********************** */

/* Analyzer and Synonym Map */



public class TechAnalyzer extends Analyzer {

public TechAnalyzer(Options opts) {

this.options = opts; // opts is a class which says if we are searching
or querying, and where the synonym list is

}

Override

protected TokenSTreamComponents createComponents(String fieldname) {

WhitespaceTokenizer src = new WhitespaceTokenizer();

TokenStream result = new TechTokenFilter(new LowerCaseFilter(src));

if (options.indexing) {

result = new SynonymGraphFilter(result,
getSynonyms(options.synonymList), true);

result = new FlattenGraphFilter(result);

}

return new TokenStreamComponents(src, result);

}



private static SynonymMap getSynonyms(String synlist) {

boolean dedup = true;

SynonymMap synMap = null;

SynonymMap.Builder builder = new SynonymMap.Builder(dedup);

BufferedReader br = new BufferedReader(new FileReader(synlist));

String line;

while ((line = br.readLine()) != null) {

processLine(builder, line);

}

br.close();

synMap = builder.build();

return synMap;

}



private static void processLine(SynonymMap.Builder builder, String line) {

Boolean includeOrig = true;

String terms[] = line.split(",");

String word = terms[0];

String[] synonymsOfWord = Arrays.copyOfRange(terms, 1, terms.length);

for (String synonym : synonymsOfWord) {

addSyn(builder, word, synonym, includeOrig);

}

}



private static void addSyn(SynonymMap.Builder builder, String word, String
synonym, boolean includeOrig) {

CharsRef syn = SynonymMap.Builder.join(synonym.split("]]s+"), new
CharsRefBuilder());

builder.add(new CharsRef(word), syn, includeOrig);

}



private Options options;

}



/* Building the index, opts.indexing = true */



Analyzer analyzer = new TechAnalyzer();

IndexWriterConfig iwc = new IndexWriterConfig(analyzer);

iwc.setOpenMode(OpenMode.CREATE);

etc.



/* Searching the index, opts.indexing = false */



IndexReader reader =
DirectoryReader.open(FSDirectory.open(Paths.get(index)));

IndexSearcher searcher = new IndexSearcher(reader);

Analyzer analyzer = new TechAnalyzer();

QueryParser parser = new QueryParser("text", analyzer);

parser.setDefaultOperator(QueryParserBase.AND_OPERATOR);

String line = inp.readLine();

Query query = parser.parse(line);

TopDocs results = searcher.search(query, 100);

ScoreDoc[] hits = results.scoreDocs;

etc.

/* ******************* END CODE *********************** */



While this works perfectly with simple word synonyms (such as those given as
examples above), it fails to match documents when the synonym list includes
phrases rather than words, although the code to construct the synonym map
appears (to my eyes) to be creating it correctly.

Thus if I add something like



infantry,footsoldier,foot soldier



then a search for "infantry" will match "footsoldier" but not "foot
soldier", and a search for "footsoldier" will match "infantry" but not "foot
soldier", and a search for "foot soldier" will match "foot soldier" but not
"footsoldier" or "infantry".



I expect I have to do something more sophisticated than just using a
QueryParser on a list of one or more words, but what and how?



Cheers

T