Mailing List Archive

Two issues with synonym phrase matching
I am indexing some technical documentation and have been trying to add
synonym matching to the searches. Actually I am adding the synonyms at index
time so that any synonyms match at search time.



a. Simple synonyms (wordA = wordB) are working just fine.



b. Multiple synonyms (wordA = wordB = wordC.) are almost working, but
not quite. The problem appears to be that wordA and wordB are allocated the
same position in the index vector, but wordC is shifted by 1 (and a wordD is
shifted by 2, etc.) Thus phrase searches including one of the synonyms only
work with a proximity modifier.



c. Synonym phrases (wordA = wordB wordC) are not working properly.



I have prepared a simple test case which can be downloaded here:
https://www.dropbox.com/s/rn4np7ja4wcpodl/mydemo.zip?dl=0



(The download is ~5Mb because it includes the 3 Lucene JAR files which are
required, these are from Lucene 9.2.0)



Unzip the download into a directory called "mydemo" and compile & run it.
The example assumes you have Ant; if you don't it is a simple enough example
that you should be able to emulate the steps after reading the build.xml
file.



As well as the three library files, the zipfile provides four Java files,
three trivial documents, a synonym list which is loaded by the indexing
step, and two query lists for the search step.



The three documents are almost identical; they contain a sentence which
variously contains "release note" (or "release notes" or "release notice"),
and "document subtree" (or "sub tree" or "sub-tree").



The synonym list contains two sets of synonyms:

note,notes,notice,notification

subtree,sub tree,sub-tree



The query lists each contain about 10 queries. Every single query should
match all three documents, but some of them do not.



a. The querylist query.rn.in shows that, where a term has multiple
synonyms, their positions are shifted. Thus "release note" and "release
notes" match all three documents, "release notice" only matches if the
search term is "release notice"~1 (because notice is the second synonym and
has been shifted one position), and "release notification" only matches if
the term is "release notification"~2 (because notification is the third
synonym and has been shifted two positions. The command "ant rnsearch"
should run these searches and show the results.



b. The querylist query.st.in shows that the phrase "sub tree" is not
being correctly identified as a synonym of the other two terms. The command
"ant stsearch" should run these.



If anyone can point to what I am doing wrong in MyAnalyzer.java I would be
extremely grateful.



Cheers

T