Mailing List Archive

Term Vectors -- searching or just ranking?
Hi,

We are implementing term vectors, and there is something about which I am
unclear: Can term vectors be used to perform a search in its entirety
(e.g., rank all 1 million documents in a database order, and then return the
top 100), or, due to computational time requirements, are term vectors only
intended to be a ranking method for a small subset of data that is the
result of a Boolean search (e.g., we know the 100 documents that possible
answers, now put them in relevancy order)?

Thanks,
James
Re: Term Vectors -- searching or just ranking? [ In reply to ]
Hi James,

I can't speak for anyone else, but my experience is that the general
approach is to first select a subset based on the angle between the query
vector and the document vector, in their non-reduced forms (this is a normal
search-for-keyword, what Lucene does by default, in vector notation). From
there, you pick up the (subset) documents along with their reduced term
vectors and compare their angle toward the reduced query vector.
If you skip the first step, you will have one dot product (query vector and
document vector) for every document in your database, but you will only need
to store the reduced term vectors. That's a lot of computation, but it's
necessary if you want to match documents that are related to a query but
does not contain any/some of the words in it. In my experience, the
advantages of this approach is a cool feature, but the hits returned are
usually pretty shitty. If you don't get a hit on a normal keyword search,
just leave the document (note, this is only my oppinion).
Some terminology if you did not follow: "reduced" refers to the projection
of a vector on to a smaller subspace (you can normally reduce the dimension
/ column space of the term-document matrix by ~60% and have virtually no
loss of precision in your searches). See "singular value decomposition", for
that matter.

Hope that helps,
Fredrik




On 4/20/06, James <james@ryley.com> wrote:
>
> Hi,
>
> We are implementing term vectors, and there is something about which I am
> unclear: Can term vectors be used to perform a search in its entirety
> (e.g., rank all 1 million documents in a database order, and then return
> the
> top 100), or, due to computational time requirements, are term vectors
> only
> intended to be a ranking method for a small subset of data that is the
> result of a Boolean search (e.g., we know the 100 documents that possible
> answers, now put them in relevancy order)?
>
> Thanks,
> James
>
>