Mailing List Archive

search similar docs?
Hi,

I was thinking of implementing a search for similar documents (like some commercial search engines do) and wondering if anyone has
already done something like that with Lucene. I thought of collecting all terms of the selected document (or maybe some subset of
them) and then creating a MultiTermQuery containing those terms. Does it make sense? Is there a better way to achieve this?

In order to do it, I would have to get all terms of a given document and so far I haven't found an easy way of doing it (I hope
there's one ;-). The way I was thinking is to extend FilteredTermEnum but, instead of selecting terms by similarity, select them by
docid (for each term, get its termdocs and check for the desired docid). It doesn't look very efficient so if someone could
contribute with other ideas or even related experiences I'd appreciate very much.

TIA

Best regards,

--Daniel


--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>
Re: search similar docs? [ In reply to ]
On Tue, Feb 12, 2002 at 05:24:45PM -0300, Daniel Calvo wrote:
> Hi,
>
> I was thinking of implementing a search for similar documents (like some commercial search engines do) and wondering if anyone has
> already done something like that with Lucene. I thought of collecting all terms of the selected document (or maybe some subset of
> them) and then creating a MultiTermQuery containing those terms. Does it make sense? Is there a better way to achieve this?

I'd think it would be hard to gather a list of meaningful terms
from the current hit that are meaningful to the user. It would seem
that an alias expansion on the origional searh experssion, or
possibly even a collection of terms (of the most common terms
in the document we're looking for documents like) after
going through a stop word analyzer or something.

I've not implmented anything like this. Just a few thoughts.

Andy

--
--------------------------------------------------
Andrew Libby
CommNav, Inc
alibby@commnav.com


--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>
RE: search similar docs? [ In reply to ]
Can't you feed the text of the orig/matching doc to the search engine
as a query and see what docs it returns? Then a "similar" has words in
common
w/ the orig doc.

I've done this kind of monster query with some of our internal systems -
we
have support mail that comes in, and all kinds of intranet sites, bug
databases,
and javadoc indexed. When reading the support mail in a web interface
you can feed the *entire* body of the mail to the search engine to find
"similar" support mails, bug reports, howto docs etc. Not fast and just
a proof
of concept right now but kinda intersting. [.note: had to switch the
<form>
action from GET to POST due to the size of the "query"]

-----Original Message-----
From: Daniel Calvo [mailto:dcalvo@ig.com.br]
Sent: Tuesday, February 12, 2002 12:25 PM
To: Lucene Users List
Subject: search similar docs?


Hi,

I was thinking of implementing a search for similar documents (like some
commercial search engines do) and wondering if anyone has
already done something like that with Lucene. I thought of collecting
all terms of the selected document (or maybe some subset of
them) and then creating a MultiTermQuery containing those terms. Does it
make sense? Is there a better way to achieve this?

In order to do it, I would have to get all terms of a given document and
so far I haven't found an easy way of doing it (I hope
there's one ;-). The way I was thinking is to extend FilteredTermEnum
but, instead of selecting terms by similarity, select them by
docid (for each term, get its termdocs and check for the desired docid).
It doesn't look very efficient so if someone could
contribute with other ideas or even related experiences I'd appreciate
very much.

TIA

Best regards,

--Daniel


--
To unsubscribe, e-mail:
<mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail:
<mailto:lucene-user-help@jakarta.apache.org>


--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>