Mailing List Archive

VSM in Lucene, again
Hi folks.

I read a transcript from last months digest of this list, in a post by
Rajesh Munavalli, that Lucene uses a VSM retrieval method. In my previous
work with VSM, it has included matching a query vector towards the documents
in the term-document space. I have dissected and customized a lot of classes
in the Lucene indexing and searching classes, but I have yet to discover
where the actual dot product of the query vector and the document vectors is
performed, if Lucene uses this method for information retrieval. Using this
method involves a certain angle which you consider as "close", which is a
parameter that Lucene would benefit from exposing in its API. This I have
not seen any trails of, either. To keep a long story short, a lot of the
stuff that I usually associate with VSM and LSI information retrieval is
missing or cleverly hidden.

If someone could shed some light on this issue, I would be very thankful.
It's probably just that we have different notions of the VSM model, but I'd
like to get this straightened out.

Greetings,
Fredrik
Re: VSM in Lucene, again [ In reply to ]
Hi Fredrik,

Are you looking for org.apache.lucene.search.DefaultSimilarity ?

Otis

--- Fredrik Andersson <fidde.andersson@gmail.com> wrote:

> Hi folks.
>
> I read a transcript from last months digest of this list, in a post
> by
> Rajesh Munavalli, that Lucene uses a VSM retrieval method. In my
> previous
> work with VSM, it has included matching a query vector towards the
> documents
> in the term-document space. I have dissected and customized a lot of
> classes
> in the Lucene indexing and searching classes, but I have yet to
> discover
> where the actual dot product of the query vector and the document
> vectors is
> performed, if Lucene uses this method for information retrieval.
> Using this
> method involves a certain angle which you consider as "close", which
> is a
> parameter that Lucene would benefit from exposing in its API. This I
> have
> not seen any trails of, either. To keep a long story short, a lot of
> the
> stuff that I usually associate with VSM and LSI information retrieval
> is
> missing or cleverly hidden.
>
> If someone could shed some light on this issue, I would be very
> thankful.
> It's probably just that we have different notions of the VSM model,
> but I'd
> like to get this straightened out.
>
> Greetings,
> Fredrik
>
Re: VSM in Lucene, again [ In reply to ]
Hi Otis,

Yes, I have looked through that class thoroughly, but all I see is an
IDF-map lookup with boost functionality. The only thing allowing a query to
return a document that is not containing the terms in the query is by the
sloppyFreq function. It's more of a semantic trick based on edit distance,
so it has nothing to do with the vector angles in a regular vector space
model. The document terms still have to be semantically similar to the ones
in the query, which is not the case when matching by vector angles in a VSM
(though you often boost documents containing words from the query,
naturally).

Fredrik

On 9/5/05, Otis Gospodnetic <otis_gospodnetic@yahoo.com> wrote:
>
> Hi Fredrik,
>
> Are you looking for org.apache.lucene.search.DefaultSimilarity ?
>
> Otis
>
> --- Fredrik Andersson <fidde.andersson@gmail.com> wrote:
>
> > Hi folks.
> >
> > I read a transcript from last months digest of this list, in a post
> > by
> > Rajesh Munavalli, that Lucene uses a VSM retrieval method. In my
> > previous
> > work with VSM, it has included matching a query vector towards the
> > documents
> > in the term-document space. I have dissected and customized a lot of
> > classes
> > in the Lucene indexing and searching classes, but I have yet to
> > discover
> > where the actual dot product of the query vector and the document
> > vectors is
> > performed, if Lucene uses this method for information retrieval.
> > Using this
> > method involves a certain angle which you consider as "close", which
> > is a
> > parameter that Lucene would benefit from exposing in its API. This I
> > have
> > not seen any trails of, either. To keep a long story short, a lot of
> > the
> > stuff that I usually associate with VSM and LSI information retrieval
> > is
> > missing or cleverly hidden.
> >
> > If someone could shed some light on this issue, I would be very
> > thankful.
> > It's probably just that we have different notions of the VSM model,
> > but I'd
> > like to get this straightened out.
> >
> > Greetings,
> > Fredrik
> >
>
>
RE: VSM in Lucene, again [ In reply to ]
Hi Fredrik,

I have asked question before, Erik Hatcher has give me the link below

http://www.lucenebook.com/blog/errata/scoring_formula_omission.html

It shows a formula which was not completely implemented.

Regards
Madhu

-----Original Message-----
From: Fredrik Andersson [mailto:fidde.andersson@gmail.com]
Sent: Monday, September 05, 2005 1:35 PM
To: general@lucene.apache.org
Subject: Re: VSM in Lucene, again

Hi Otis,

Yes, I have looked through that class thoroughly, but all I see is an
IDF-map lookup with boost functionality. The only thing allowing a query
to
return a document that is not containing the terms in the query is by
the
sloppyFreq function. It's more of a semantic trick based on edit
distance,
so it has nothing to do with the vector angles in a regular vector space

model. The document terms still have to be semantically similar to the
ones
in the query, which is not the case when matching by vector angles in a
VSM
(though you often boost documents containing words from the query,
naturally).

Fredrik

On 9/5/05, Otis Gospodnetic <otis_gospodnetic@yahoo.com> wrote:
>
> Hi Fredrik,
>
> Are you looking for org.apache.lucene.search.DefaultSimilarity ?
>
> Otis
>
> --- Fredrik Andersson <fidde.andersson@gmail.com> wrote:
>
> > Hi folks.
> >
> > I read a transcript from last months digest of this list, in a post
> > by
> > Rajesh Munavalli, that Lucene uses a VSM retrieval method. In my
> > previous
> > work with VSM, it has included matching a query vector towards the
> > documents
> > in the term-document space. I have dissected and customized a lot of
> > classes
> > in the Lucene indexing and searching classes, but I have yet to
> > discover
> > where the actual dot product of the query vector and the document
> > vectors is
> > performed, if Lucene uses this method for information retrieval.
> > Using this
> > method involves a certain angle which you consider as "close", which
> > is a
> > parameter that Lucene would benefit from exposing in its API. This I
> > have
> > not seen any trails of, either. To keep a long story short, a lot of
> > the
> > stuff that I usually associate with VSM and LSI information
retrieval
> > is
> > missing or cleverly hidden.
> >
> > If someone could shed some light on this issue, I would be very
> > thankful.
> > It's probably just that we have different notions of the VSM model,
> > but I'd
> > like to get this straightened out.
> >
> > Greetings,
> > Fredrik
> >
>
>