Hi,
I have some questions about how to use Lucene for the specific purpose of
finding document similarities. Lucene seems to have classes that were made
for this, including: ClassicSimilarity and BM25Similarity. However I’m
fumbling a bit when it comes to implementing them.
From what I understand, to use these classes you simply set the similarity
of your IndexWriter and IndexSearcher, then submit a query. The documents
returned from your query should be ordered from highest to lowest
similarity.
My initial thought was to just use a phrase query to hold the "document" I
want to find similarities to, but phrase queries are limited in that they
will only return results that are deemed to fall within a certain slop
value. Term/Boolean queries are similarly limited in that they allow
documents to be sorted only if they contain all the terms in the query.
Ideally, I’d like to submit a query that would essentially be a document
itself. Each of my queries contain 10 or so phrases, that each contain 5-10
terms. I would like to compare this query with all the documents in my
index to see which is the most similar, and which is the least similar. I
feel as if there is an easy way to do this that I'm missing, after all, I
essentially just want to remove a step from the process. Any help would be
much appreciated.
Thank you,
-John B
I have some questions about how to use Lucene for the specific purpose of
finding document similarities. Lucene seems to have classes that were made
for this, including: ClassicSimilarity and BM25Similarity. However I’m
fumbling a bit when it comes to implementing them.
From what I understand, to use these classes you simply set the similarity
of your IndexWriter and IndexSearcher, then submit a query. The documents
returned from your query should be ordered from highest to lowest
similarity.
My initial thought was to just use a phrase query to hold the "document" I
want to find similarities to, but phrase queries are limited in that they
will only return results that are deemed to fall within a certain slop
value. Term/Boolean queries are similarly limited in that they allow
documents to be sorted only if they contain all the terms in the query.
Ideally, I’d like to submit a query that would essentially be a document
itself. Each of my queries contain 10 or so phrases, that each contain 5-10
terms. I would like to compare this query with all the documents in my
index to see which is the most similar, and which is the least similar. I
feel as if there is an easy way to do this that I'm missing, after all, I
essentially just want to remove a step from the process. Any help would be
much appreciated.
Thank you,
-John B