Mailing List Archive: Re: Normalization of Documents

Re: Normalization of Documents

Apr 11, 2002, 7:35 AM

Post #1 of 9 (1833 views)

Hi,

These types of questions/discussions should be on the users list, not dev
list, please.

Just for the record, the Lucene scoring is not as simple as just a %.
From the FAQ.

For the record, Lucene's scoring algorithm is, roughly:

score_d = sum_t( tf_q * idf_t / norm_q * tf_d * idf_t / norm_d_t)

where:
score_d : score for document d
sum_t : sum for all terms t
tf_q : the square root of the frequency of t in the query
tf_d : the square root of the frequency of t in d
idf_t : log(numDocs/docFreq_t+1) + 1.0
numDocs : number of documents in index
docFreq_t : number of documents containing t
norm_q : sqrt(sum_t((tf_q*idf_t)^2))
norm_d_t : square root of number of tokens in d in the same field as t

(I hope that's right!)

[Doug later added...]

Make that:

score_d = sum_t(tf_q * idf_t / norm_q * tf_d * idf_t / norm_d_t * boost_t)
* coord_q_d

where

boost_t : the user-specified boost for term t
coord_q_d : number of terms in both query and document / number of terms
in query

The coordination factor gives an AND-like boost to documents that contain,
e.g., all three terms in a three word query over those that contain just two
of the words.

Although this may still not be what you want. You should be able to replace
the scoring mechanism with your own. The problem you might run into is
getting the document data (such as date) will slow down your search speed
dramatically.

Do you know about any solutions (Academic or free) that provide this concept
extraction. I've heard of a group in the UK who worked on something like
this.

--Peter

On 4/11/02 6:51 AM, "Halácsy Péter" <halacsy.peter@axelero.com> wrote:

> Extracting concept is not too easy thing and I don't think you can implement a
> language/context/document type independent solution. Filtering only important
> terms of a text (and not index all text as in modern full text indexing
> system) is one of the most important area of IR. A lot of project worked on
> this topic but nowadays it's not too important because we can index every
> terms if we want (cheaper and faster disk, lot of CPU).
>
> I think in lucene the the term's % of the document
> (NUMBER_OF_WORDS_IN_THE_DOCUMENT / NUMBER_OF_QUERY_TERM_ACCURENCE )is
> overweighted in some case. I would like to tune it if I could.
>
> Document scoring could provide solution for me and I think for Melissa as
> well. I think it's a very important feature of a modern IR system. For example
> Melissa would use it to score the documents based on link popularity (or
> impact factor/citation frequency). In my project I should score documents on
> their length and their age (more recent document is more valuable and very old
> documents are as valuable as very new in my archive).
>
> peter
>
>> -----Original Message-----
>> From: Peter Carlson [mailto:carlson@bookandhammer.com]
>> Sent: Wednesday, April 10, 2002 5:17 PM
>> To: Lucene Developers List
>> Subject: Re: Normalization of Documents
>>
>>
>> I have noticed the same issue.
>>
>> From what I understand, this is both the way it should work
>> and a problem.
>> Shorter documents which have a given term, should be more
>> relevant because
>> more of the document is about that term (i.e the term takes a
>> greater % of
>> the document). However, when there are documents of
>> completely different
>> sizes (i.e. 20 words vs. 2000 words) this assumption doesn't
>> hold up very
>> well.
>>
>> One solution I've heard is to extract the concepts of the
>> documents, then
>> search on those. The concepts are still difficult to extract
>> if the document
>> is too short, but it may provide a way to standardize
>> documents. I have been
>> lazily looking for an open source, academic concept
>> extractor, but I haven't
>> found one. There are companies like Semio and
>> ActiveNavigation which provide
>> this service for an expense fee.
>>
>> Let me know if you find anything or have other ideas.
>>
>> --Peter
>>
>>
>> On 4/9/02 10:07 PM, "Melissa Mifsud" <melissamifsud@yahoo.com> wrote:
>>
>>> Hi,
>>>
>>> Documents which are shorter in length always seem to score
>> higher in Lucene. I
>>> was under the impression that the nornalization factors in
>> the scoring
>>> function used by Lucene would improve this, however, after
>> a couple of
>>> experiments, the short documents still always score the highest.
>>>
>>> Does anyone have any ideas as to how it is possible to make
>> lengthier
>>> documents score higher?
>>>
>>> Also, I would like a way to boost documents according to
>> the amount of
>>> in-links this document has.
>>>
>>> Has anyone implemented a type of Document.setBoost() method?
>>>
>>> I found a thread in the lucene-dev mailinglist where Doug
>> Cutting mentions
>>> that this would be a great feature to add to Lucene. No one
>> followed his
>>> email.
>>>
>>> Melissa.
>>>
>>
>>
>> --
>> To unsubscribe, e-mail:
>> <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
>> For additional commands, e-mail:
>> <mailto:lucene-dev-help@jakarta.apache.org>
>>
>>
>
> --
> To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
> For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>
>
>

--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>

RE: Normalization of Documents [ In reply to ]

halacsy.peter at axelero

Apr 11, 2002, 7:43 AM

Post #2 of 9 (1824 views)

Permalink

> -----Original Message-----
> From: Peter Carlson [mailto:carlson@bookandhammer.com]
> Sent: Thursday, April 11, 2002 4:35 PM
> To: Lucene Users List
> Subject: Re: Normalization of Documents
>
>
> Hi,
>
> These types of questions/discussions should be on the users
> list, not dev
> list, please.
>
OK

>
> Just for the record, the Lucene scoring is not as simple as just a %.
> From the FAQ.
>
> For the record, Lucene's scoring algorithm is, roughly:
>
> score_d = sum_t( tf_q * idf_t / norm_q * tf_d * idf_t / norm_d_t)
What I would like:

score_d = sum_t( tf_q * idf_t / norm_q * tf_d * idf_t / norm_d_t) * p_value_d

where:
p_value_d : predefined value of document calculated at indexing time (0 < p_value_d <= 1)

in the API:
option 1:
writer = new IndexWriter(..)
writer.addDocument(doc, 0.45);

option 2 (I think better)

Document d = new Document();
d.setValue(0.45);
d.addField(..);
writer.addDocument();

peter

>
> where:
> score_d : score for document d
> sum_t : sum for all terms t
> tf_q : the square root of the frequency of t in the query
> tf_d : the square root of the frequency of t in d
> idf_t : log(numDocs/docFreq_t+1) + 1.0
> numDocs : number of documents in index
> docFreq_t : number of documents containing t
> norm_q : sqrt(sum_t((tf_q*idf_t)^2))
> norm_d_t : square root of number of tokens in d in the
> same field as t
>
> (I hope that's right!)
>
> [Doug later added...]
>
> Make that:
>
> score_d = sum_t(tf_q * idf_t / norm_q * tf_d * idf_t /
> norm_d_t * boost_t)
> * coord_q_d
>
> where
>
> boost_t : the user-specified boost for term t
> coord_q_d : number of terms in both query and document /
> number of terms
> in query
>
> The coordination factor gives an AND-like boost to documents
> that contain,
> e.g., all three terms in a three word query over those that
> contain just two
> of the words.
>
>
>
> Although this may still not be what you want. You should be
> able to replace
> the scoring mechanism with your own. The problem you might run into is
> getting the document data (such as date) will slow down your
> search speed
> dramatically.
>
> Do you know about any solutions (Academic or free) that
> provide this concept
> extraction. I've heard of a group in the UK who worked on
> something like
> this.
>
> --Peter
>
>
>
> On 4/11/02 6:51 AM, "Halácsy Péter" <halacsy.peter@axelero.com> wrote:
>
> > Extracting concept is not too easy thing and I don't think
> you can implement a
> > language/context/document type independent solution.
> Filtering only important
> > terms of a text (and not index all text as in modern full
> text indexing
> > system) is one of the most important area of IR. A lot of
> project worked on
> > this topic but nowadays it's not too important because we
> can index every
> > terms if we want (cheaper and faster disk, lot of CPU).
> >
> > I think in lucene the the term's % of the document
> > (NUMBER_OF_WORDS_IN_THE_DOCUMENT /
> NUMBER_OF_QUERY_TERM_ACCURENCE )is
> > overweighted in some case. I would like to tune it if I could.
> >
> > Document scoring could provide solution for me and I think
> for Melissa as
> > well. I think it's a very important feature of a modern IR
> system. For example
> > Melissa would use it to score the documents based on link
> popularity (or
> > impact factor/citation frequency). In my project I should
> score documents on
> > their length and their age (more recent document is more
> valuable and very old
> > documents are as valuable as very new in my archive).
> >
> > peter
> >
> >> -----Original Message-----
> >> From: Peter Carlson [mailto:carlson@bookandhammer.com]
> >> Sent: Wednesday, April 10, 2002 5:17 PM
> >> To: Lucene Developers List
> >> Subject: Re: Normalization of Documents
> >>
> >>
> >> I have noticed the same issue.
> >>
> >> From what I understand, this is both the way it should work
> >> and a problem.
> >> Shorter documents which have a given term, should be more
> >> relevant because
> >> more of the document is about that term (i.e the term takes a
> >> greater % of
> >> the document). However, when there are documents of
> >> completely different
> >> sizes (i.e. 20 words vs. 2000 words) this assumption doesn't
> >> hold up very
> >> well.
> >>
> >> One solution I've heard is to extract the concepts of the
> >> documents, then
> >> search on those. The concepts are still difficult to extract
> >> if the document
> >> is too short, but it may provide a way to standardize
> >> documents. I have been
> >> lazily looking for an open source, academic concept
> >> extractor, but I haven't
> >> found one. There are companies like Semio and
> >> ActiveNavigation which provide
> >> this service for an expense fee.
> >>
> >> Let me know if you find anything or have other ideas.
> >>
> >> --Peter
> >>
> >>
> >> On 4/9/02 10:07 PM, "Melissa Mifsud"
> <melissamifsud@yahoo.com> wrote:
> >>
> >>> Hi,
> >>>
> >>> Documents which are shorter in length always seem to score
> >> higher in Lucene. I
> >>> was under the impression that the nornalization factors in
> >> the scoring
> >>> function used by Lucene would improve this, however, after
> >> a couple of
> >>> experiments, the short documents still always score the highest.
> >>>
> >>> Does anyone have any ideas as to how it is possible to make
> >> lengthier
> >>> documents score higher?
> >>>
> >>> Also, I would like a way to boost documents according to
> >> the amount of
> >>> in-links this document has.
> >>>
> >>> Has anyone implemented a type of Document.setBoost() method?
> >>>
> >>> I found a thread in the lucene-dev mailinglist where Doug
> >> Cutting mentions
> >>> that this would be a great feature to add to Lucene. No one
> >> followed his
> >>> email.
> >>>
> >>> Melissa.
> >>>
> >>
> >>
> >> --
> >> To unsubscribe, e-mail:
> >> <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
> >> For additional commands, e-mail:
> >> <mailto:lucene-dev-help@jakarta.apache.org>
> >>
> >>
> >
> > --
> > To unsubscribe, e-mail:
> <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
> > For additional commands, e-mail:
> <mailto:lucene-dev-help@jakarta.apache.org>
> >
> >
>
>
> --
> To unsubscribe, e-mail:
> <mailto:lucene-user-unsubscribe@jakarta.apache.org>
> For additional commands, e-mail:
> <mailto:lucene-user-help@jakarta.apache.org>
>
>

--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>

RE: Normalization of Documents [ In reply to ]

apache at lucene

Apr 11, 2002, 12:19 PM

Post #3 of 9 (1819 views)

Permalink

> From: Halácsy Péter
>
> What I would like:
>
> score_d = sum_t( tf_q * idf_t / norm_q * tf_d * idf_t /
> norm_d_t) * p_value_d
>
> where:
> p_value_d : predefined value of document calculated at
> indexing time (0 < p_value_d <= 1)
>
> in the API:
> option 1:
> writer = new IndexWriter(..)
> writer.addDocument(doc, 0.45);
>
> option 2 (I think better)
>
> Document d = new Document();
> d.setValue(0.45);
> d.addField(..);
> writer.addDocument();

This would not be hard to add to Lucene. I would like to add it as soon as
we get the 1.2 release finalized.

I also prefer the second style of interface, however the method should
probably be on Field not on Document. Something like
Field.setBoost(float);
Perhaps we could also add a Document.setBoost(float) method which would
provide a default boost for all fields added to that document.

Doug

--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>

Re: Normalization of Documents [ In reply to ]

carlson at bookandhammer

Apr 13, 2002, 8:44 AM

Post #4 of 9 (1817 views)

Permalink

Hi Bernhard,

I think this is a very interesting issue.

I think that changing the scoring algorithm is one part of it, the other is
to get the information from the Document to use in the ranking. Since this
is an expensive operation, there will have to be an alternative approach.

Do you have any suggestions to start?

Thanks

--Peter

On 4/13/02 6:05 AM, "Bernhard Messer" <Bernhard.Messer@intrafind.de> wrote:

>
> the topic you are focusing on is a never ending story in content
> retrieval in general. There is no perfect solution which fits in every
> environment. Retrieving a document's context based on a single query
> term seems to be very difficult also. In Lucene it isn't de very
> difficult to change the ranking algorithm. If you don't like the field
> normalization, you could comment the following in line in the TermScorer
> class.
>
> score *= Similarity.norm(norms[d]);
>
> If you put a comment around this line, youre scoring is based on the
> term frequency.
>
> If more people are interested, we could think on a little bit more
> flexible ranking system within Lucene. There would be several parameters
> which from the environment which could be used to rank a document.
> Therefore we would need an interface where we could change the lucene
> document boost factor during runtime. For example, a document's ranking
> could be based on:
> links pointing to that document (like Google)
> last modification date,
> size of the document,
> term frequency,
> how often was it displayed by other users, sending the same query
> terms to the system
> .....

--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>

RE: Normalization of Documents [ In reply to ]

halacsy.peter at axelero

Apr 13, 2002, 9:03 AM

Post #5 of 9 (1813 views)

Permalink

> Therefore we would need an interface where we could change the lucene=20
> document boost factor during runtime. For example, a=20
> document's ranking=20
> could be based on:
> links pointing to that document (like Google)
> last modification date,
> size of the document,
> term frequency,
> how often was it displayed by other users, sending the same query=20
> terms to the system
> .....

4 of these 5 are based on a pre-calculated document value/weight/score =
(I don't exactly understand what term frequency means in this context). =
If I could assign a value to every document (as I proposed in a mail) we =
could start to implement some algorithm to calculate different values =
(for example link calculating popularity/page rank needs a matrix =
inversion that isn't too simple)

> Let me know if you find that idea interessting, i would like=20
> to work on=20
> that topic.
I find it very interesting.

peter

On 4/13/02 6:05 AM, "Bernhard Messer"
> <Bernhard.Messer@intrafind.de> wrote:
>
>
> >
> > the topic you are focusing on is a never ending story in content
> > retrieval in general. There is no perfect solution which
> fits in every
> > environment. Retrieving a document's context based on a single query
> > term seems to be very difficult also. In Lucene it isn't de very
> > difficult to change the ranking algorithm. If you don't
> like the field
> > normalization, you could comment the following in line in
> the TermScorer
> > class.
> >
> > score *= Similarity.norm(norms[d]);
> >
> > If you put a comment around this line, youre scoring is based on the
> > term frequency.
> >
> > If more people are interested, we could think on a little bit more
> > flexible ranking system within Lucene. There would be
> several parameters
> > which from the environment which could be used to rank a document.
> > Therefore we would need an interface where we could change
> the lucene
> > document boost factor during runtime. For example, a
> document's ranking
> > could be based on:
> > links pointing to that document (like Google)
> > last modification date,
> > size of the document,
> > term frequency,
> > how often was it displayed by other users, sending the same query
> > terms to the system
> > .....
>
>
> --
> To unsubscribe, e-mail:
> <mailto:lucene-user-unsubscribe@jakarta.apache.org>
> For additional commands, e-mail:
> <mailto:lucene-user-help@jakarta.apache.org>
>
>

--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>

RE: Normalization of Documents [ In reply to ]

otis_gospodnetic at yahoo

Apr 13, 2002, 10:02 AM

Post #6 of 9 (1816 views)

Permalink

> > Let me know if you find that idea interessting, i would like=20
> > to work on=20
> > that topic.
> I find it very interesting.

Ich auch.

Otis

__________________________________________________
Do You Yahoo!?
Yahoo! Tax Center - online filing with TurboTax
http://taxes.yahoo.com/

--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>

RE: Normalization of Documents [ In reply to ]

halacsy.peter at axelero

Apr 15, 2002, 12:34 AM

Post #7 of 9 (1813 views)

Permalink

> -----Original Message-----
> From: apache@lucene.com [mailto:apache@lucene.com]
> Sent: Thursday, April 11, 2002 9:19 PM
> To: lucene-user@jakarta.apache.org
> Subject: RE: Normalization of Documents
>
>
> > From: Halácsy Péter
> >
> > option 2 (I think better)
> >
> > Document d = new Document();
> > d.setValue(0.45);
> > d.addField(..);
> > writer.addDocument();
>
> This would not be hard to add to Lucene. I would like to add
> it as soon as
> we get the 1.2 release finalized.
>
> I also prefer the second style of interface, however the method should
> probably be on Field not on Document. Something like
> Field.setBoost(float);
> Perhaps we could also add a Document.setBoost(float) method
> which would
> provide a default boost for all fields added to that document.
>
> Doug

it sounds great! thanks

peter

--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>

Re: Normalization of Documents [ In reply to ]

melissamifsud at yahoo

Apr 15, 2002, 7:15 AM

Post #8 of 9 (1816 views)

Permalink

> Let me know if you find that idea interessting, i would like to work on
> that topic.

Seeing as I bought the topic up... I'm interested!!

I've been doing alot of research for my University thesis on IR and the type
of information that can be gathered from individual documents themselves and
the structure of documents which are linked in some way (as hypertext). This
information can be used to change (add/filter) what exactly is indexed,
change the ranking algorithm and boosting documents. I'm sure I could put
my research to use!

Melissa

_________________________________________________________
Do You Yahoo!?
Get your free @yahoo.com address at http://mail.yahoo.com

--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>

Re: Normalization of Documents [ In reply to ]

jmadden at ics

Apr 16, 2002, 11:55 AM

Post #9 of 9 (1817 views)

Permalink

from Bernhard Messer:

> > Let me know if you find that idea interessting, i would like to work on
> > that topic.

Yup, me too. This is germane to my research as well.

Joshua

jmadden@ics.uci.edu...Obscurium Per Obscurius...www.ics.uci.edu/~jmadden
Joshua Madden: Information Scientist, Musician, Philosopher-At-Tall
It's that moment of dawning comprehension that I live for--Bill Watterson
My opinions are too rational and insightful to be those of any organization.

--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>