Mailing List Archive

search alogorithm in Lucene
Hi all,
I new to the lucene, but I am familiar with the IR. I want build IR
system in Java and I found Lucene, but some questions remained
unanswered for me after searching complete website.

I have couple of questions regarding Lucene,

1. What is the search algorithm(s)[VSM, ..] used or available in the
Lucene?

2. How term weight is calculated in Lucene, how many types of term
weight calculating formulas are implemented and what are they?

Regards
Madhu
RE: search alogorithm in Lucene [ In reply to ]
Hi one more question

Is there any format of text file that lucene eexpects some think like
addition of XML tags for the text document which is given for lucene
before indexing.

regards
madhu

-----Original Message-----
From: Madhu Panitini
Sent: Monday, August 08, 2005 7:02 PM
To: general@lucene.apache.org
Subject: search alogorithm in Lucene

Hi all,
I new to the lucene, but I am familiar with the IR. I want build IR
system in Java and I found Lucene, but some questions remained
unanswered for me after searching complete website.

I have couple of questions regarding Lucene,

1. What is the search algorithm(s)[VSM, ..] used or available in the
Lucene?

2. How term weight is calculated in Lucene, how many types of term
weight calculating formulas are implemented and what are they?

Regards
Madhu
RE: search alogorithm in Lucene [ In reply to ]
Hi Madhu,

>1. What is the search algorithm(s)[VSM, ..] used or available in the
Lucene?
Lucene uses Vector Space Model as its retrieval model.

>2 How term weight is calculated in Lucene, how many types of term
weight calculating formulas are implemented and what are they?
TF-IDF weighting is used with some modifications to how the raw score is
computed. You can refer to "Lucene In Action" by Otis Gospodnetic and
Erik Hatcher" page 78. Here is the formula

= SUMMATION {tf(t in d).idf(t).boost(t.field in d). lengthNorm(t.field
in d)
t in q

There are variations in how you can compute this score. For example
by setting different boost level while indexing or searching. But the
basic scoring is still based on TF-IDF weighting.

Hope it helps.

Rajesh Munavalli

-----Original Message-----
From: Madhu Panitini [mailto:Madhu.Panitini@pass-consulting.com]
Sent: Monday, August 08, 2005 12:02 PM
To: general@lucene.apache.org
Subject: search alogorithm in Lucene

Hi all,
I new to the lucene, but I am familiar with the IR. I want build IR
system in Java and I found Lucene, but some questions remained
unanswered for me after searching complete website.

I have couple of questions regarding Lucene,

1. What is the search algorithm(s)[VSM, ..] used or available in the
Lucene?

2. How term weight is calculated in Lucene, how many types of term
weight calculating formulas are implemented and what are they?

Regards
Madhu
RE: search alogorithm in Lucene [ In reply to ]
Lucene considers text documents only. If you use the standard analyzer
all the contents in the document will be parsed the same way. To index
XML document you need to come up with your own Analyzer/Tokenizer which
separates XML tags and indexes accordingly. I guess you want to preserve
the meta-data contained in the XML document.

--
Rajesh Munavalli

-----Original Message-----
From: Madhu Panitini [mailto:Madhu.Panitini@pass-consulting.com]
Sent: Monday, August 08, 2005 12:17 PM
To: general@lucene.apache.org
Subject: RE: search alogorithm in Lucene

Hi one more question

Is there any format of text file that lucene eexpects some think like
addition of XML tags for the text document which is given for lucene
before indexing.

regards
madhu

-----Original Message-----
From: Madhu Panitini
Sent: Monday, August 08, 2005 7:02 PM
To: general@lucene.apache.org
Subject: search alogorithm in Lucene

Hi all,
I new to the lucene, but I am familiar with the IR. I want build IR
system in Java and I found Lucene, but some questions remained
unanswered for me after searching complete website.

I have couple of questions regarding Lucene,

1. What is the search algorithm(s)[VSM, ..] used or available in the
Lucene?

2. How term weight is calculated in Lucene, how many types of term
weight calculating formulas are implemented and what are they?

Regards
Madhu
Re: search alogorithm in Lucene [ In reply to ]
On Aug 8, 2005, at 1:19 PM, Rajesh Munavalli wrote:
>> 2 How term weight is calculated in Lucene, how many types of term
>>
> weight calculating formulas are implemented and what are they?
> TF-IDF weighting is used with some modifications to how the raw
> score is
> computed. You can refer to "Lucene In Action" by Otis Gospodnetic and
> Erik Hatcher" page 78. Here is the formula
>
> = SUMMATION {tf(t in d).idf(t).boost(t.field in d). lengthNorm
> (t.field
> in d)
> t in q

Note that the most crucial errata for Lucene in Action is on that
very formula. The corrected one is here:

http://www.lucenebook.com/blog/errata/scoring_formula_omission.html

Erik
RE: search alogorithm in Lucene [ In reply to ]
If you need to index XML with Lucene, you can look at my article about
using Digester+Lucene to parse+index XML documents. The article can be
found on the IBM developerWorks site.
You can also look at the code that comes with Lucene in Action where we
show how to parse with Digester and SAX 2.0 API, and index with Lucene.
Chapter 7, I believe.

Otis


--- Rajesh Munavalli <rajeshm@dessci.com> wrote:

> Lucene considers text documents only. If you use the standard
> analyzer
> all the contents in the document will be parsed the same way. To
> index
> XML document you need to come up with your own Analyzer/Tokenizer
> which
> separates XML tags and indexes accordingly. I guess you want to
> preserve
> the meta-data contained in the XML document.
>
> --
> Rajesh Munavalli
>
> -----Original Message-----
> From: Madhu Panitini [mailto:Madhu.Panitini@pass-consulting.com]
> Sent: Monday, August 08, 2005 12:17 PM
> To: general@lucene.apache.org
> Subject: RE: search alogorithm in Lucene
>
> Hi one more question
>
> Is there any format of text file that lucene eexpects some think like
> addition of XML tags for the text document which is given for lucene
> before indexing.
>
> regards
> madhu
>
> -----Original Message-----
> From: Madhu Panitini
> Sent: Monday, August 08, 2005 7:02 PM
> To: general@lucene.apache.org
> Subject: search alogorithm in Lucene
>
> Hi all,
> I new to the lucene, but I am familiar with the IR. I want build IR
> system in Java and I found Lucene, but some questions remained
> unanswered for me after searching complete website.
>
> I have couple of questions regarding Lucene,
>
> 1. What is the search algorithm(s)[VSM, ..] used or available in the
> Lucene?
>
> 2. How term weight is calculated in Lucene, how many types of term
> weight calculating formulas are implemented and what are they?
>
> Regards
> Madhu
>
>
>
>
>