Mailing List Archive

Normalization of Documents
Hi,

Documents which are shorter in length always seem to score higher in Lucene. I was under the impression that the nornalization factors in the scoring function used by Lucene would improve this, however, after a couple of experiments, the short documents still always score the highest.

Does anyone have any ideas as to how it is possible to make lengthier documents score higher?

Also, I would like a way to boost documents according to the amount of in-links this document has.

Has anyone implemented a type of Document.setBoost() method?

I found a thread in the lucene-dev mailinglist where Doug Cutting mentions that this would be a great feature to add to Lucene. No one followed his email.

Melissa.
Re: Normalization of Documents [ In reply to ]
I have noticed the same issue.

From what I understand, this is both the way it should work and a problem.
Shorter documents which have a given term, should be more relevant because
more of the document is about that term (i.e the term takes a greater % of
the document). However, when there are documents of completely different
sizes (i.e. 20 words vs. 2000 words) this assumption doesn't hold up very
well.

One solution I've heard is to extract the concepts of the documents, then
search on those. The concepts are still difficult to extract if the document
is too short, but it may provide a way to standardize documents. I have been
lazily looking for an open source, academic concept extractor, but I haven't
found one. There are companies like Semio and ActiveNavigation which provide
this service for an expense fee.

Let me know if you find anything or have other ideas.

--Peter


On 4/9/02 10:07 PM, "Melissa Mifsud" <melissamifsud@yahoo.com> wrote:

> Hi,
>
> Documents which are shorter in length always seem to score higher in Lucene. I
> was under the impression that the nornalization factors in the scoring
> function used by Lucene would improve this, however, after a couple of
> experiments, the short documents still always score the highest.
>
> Does anyone have any ideas as to how it is possible to make lengthier
> documents score higher?
>
> Also, I would like a way to boost documents according to the amount of
> in-links this document has.
>
> Has anyone implemented a type of Document.setBoost() method?
>
> I found a thread in the lucene-dev mailinglist where Doug Cutting mentions
> that this would be a great feature to add to Lucene. No one followed his
> email.
>
> Melissa.
>


--
To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>
RE: Normalization of Documents [ In reply to ]
Extracting concept is not too easy thing and I don't think you can implement a language/context/document type independent solution. Filtering only important terms of a text (and not index all text as in modern full text indexing system) is one of the most important area of IR. A lot of project worked on this topic but nowadays it's not too important because we can index every terms if we want (cheaper and faster disk, lot of CPU).

I think in lucene the the term's % of the document (NUMBER_OF_WORDS_IN_THE_DOCUMENT / NUMBER_OF_QUERY_TERM_ACCURENCE )is overweighted in some case. I would like to tune it if I could.

Document scoring could provide solution for me and I think for Melissa as well. I think it's a very important feature of a modern IR system. For example Melissa would use it to score the documents based on link popularity (or impact factor/citation frequency). In my project I should score documents on their length and their age (more recent document is more valuable and very old documents are as valuable as very new in my archive).

peter

> -----Original Message-----
> From: Peter Carlson [mailto:carlson@bookandhammer.com]
> Sent: Wednesday, April 10, 2002 5:17 PM
> To: Lucene Developers List
> Subject: Re: Normalization of Documents
>
>
> I have noticed the same issue.
>
> From what I understand, this is both the way it should work
> and a problem.
> Shorter documents which have a given term, should be more
> relevant because
> more of the document is about that term (i.e the term takes a
> greater % of
> the document). However, when there are documents of
> completely different
> sizes (i.e. 20 words vs. 2000 words) this assumption doesn't
> hold up very
> well.
>
> One solution I've heard is to extract the concepts of the
> documents, then
> search on those. The concepts are still difficult to extract
> if the document
> is too short, but it may provide a way to standardize
> documents. I have been
> lazily looking for an open source, academic concept
> extractor, but I haven't
> found one. There are companies like Semio and
> ActiveNavigation which provide
> this service for an expense fee.
>
> Let me know if you find anything or have other ideas.
>
> --Peter
>
>
> On 4/9/02 10:07 PM, "Melissa Mifsud" <melissamifsud@yahoo.com> wrote:
>
> > Hi,
> >
> > Documents which are shorter in length always seem to score
> higher in Lucene. I
> > was under the impression that the nornalization factors in
> the scoring
> > function used by Lucene would improve this, however, after
> a couple of
> > experiments, the short documents still always score the highest.
> >
> > Does anyone have any ideas as to how it is possible to make
> lengthier
> > documents score higher?
> >
> > Also, I would like a way to boost documents according to
> the amount of
> > in-links this document has.
> >
> > Has anyone implemented a type of Document.setBoost() method?
> >
> > I found a thread in the lucene-dev mailinglist where Doug
> Cutting mentions
> > that this would be a great feature to add to Lucene. No one
> followed his
> > email.
> >
> > Melissa.
> >
>
>
> --
> To unsubscribe, e-mail:
> <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
> For additional commands, e-mail:
> <mailto:lucene-dev-help@jakarta.apache.org>
>
>

--
To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>
RE: Normalization of Documents [ In reply to ]
I have a related problem where I attempt to use Lucene as a duplicate
checking mechanism. I've found that it is very difficult to get Lucene to
give a decent probability of duplication because of this specific type of
weighting that is done.

Scott

> -----Original Message-----
> From: Halácsy Péter [mailto:halacsy.peter@axelero.com]
> Sent: Thursday, April 11, 2002 8:52 AM
> To: Lucene Developers List
> Subject: RE: Normalization of Documents
>
>
> Extracting concept is not too easy thing and I don't think
> you can implement a language/context/document type
> independent solution. Filtering only important terms of a
> text (and not index all text as in modern full text indexing
> system) is one of the most important area of IR. A lot of
> project worked on this topic but nowadays it's not too
> important because we can index every terms if we want
> (cheaper and faster disk, lot of CPU).
>
> I think in lucene the the term's % of the document
> (NUMBER_OF_WORDS_IN_THE_DOCUMENT /
> NUMBER_OF_QUERY_TERM_ACCURENCE )is overweighted in some case.
> I would like to tune it if I could.
>
> Document scoring could provide solution for me and I think
> for Melissa as well. I think it's a very important feature of
> a modern IR system. For example Melissa would use it to score
> the documents based on link popularity (or impact
> factor/citation frequency). In my project I should score
> documents on their length and their age (more recent document
> is more valuable and very old documents are as valuable as
> very new in my archive).
>
> peter
>
> > -----Original Message-----
> > From: Peter Carlson [mailto:carlson@bookandhammer.com]
> > Sent: Wednesday, April 10, 2002 5:17 PM
> > To: Lucene Developers List
> > Subject: Re: Normalization of Documents
> >
> >
> > I have noticed the same issue.
> >
> > From what I understand, this is both the way it should work
> > and a problem.
> > Shorter documents which have a given term, should be more
> > relevant because
> > more of the document is about that term (i.e the term takes a
> > greater % of
> > the document). However, when there are documents of
> > completely different
> > sizes (i.e. 20 words vs. 2000 words) this assumption doesn't
> > hold up very
> > well.
> >
> > One solution I've heard is to extract the concepts of the
> > documents, then
> > search on those. The concepts are still difficult to extract
> > if the document
> > is too short, but it may provide a way to standardize
> > documents. I have been
> > lazily looking for an open source, academic concept
> > extractor, but I haven't
> > found one. There are companies like Semio and
> > ActiveNavigation which provide
> > this service for an expense fee.
> >
> > Let me know if you find anything or have other ideas.
> >
> > --Peter
> >
> >
> > On 4/9/02 10:07 PM, "Melissa Mifsud"
> <melissamifsud@yahoo.com> wrote:
> >
> > > Hi,
> > >
> > > Documents which are shorter in length always seem to score
> > higher in Lucene. I
> > > was under the impression that the nornalization factors in
> > the scoring
> > > function used by Lucene would improve this, however, after
> > a couple of
> > > experiments, the short documents still always score the highest.
> > >
> > > Does anyone have any ideas as to how it is possible to make
> > lengthier
> > > documents score higher?
> > >
> > > Also, I would like a way to boost documents according to
> > the amount of
> > > in-links this document has.
> > >
> > > Has anyone implemented a type of Document.setBoost() method?
> > >
> > > I found a thread in the lucene-dev mailinglist where Doug
> > Cutting mentions
> > > that this would be a great feature to add to Lucene. No one
> > followed his
> > > email.
> > >
> > > Melissa.
> > >
> >
> >
> > --
> > To unsubscribe, e-mail:
> > <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
> > For additional commands, e-mail:
> > <mailto:lucene-dev-help@jakarta.apache.org>
> >
> >
>
> --
> To unsubscribe, e-mail:
<mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>
Re: Normalization of Documents [ In reply to ]
Hi,

the topic you are focusing on is a never ending story in content
retrieval in general. There is no perfect solution which fits in every
environment. Retrieving a document's context based on a single query
term seems to be very difficult also. In Lucene it isn't de very
difficult to change the ranking algorithm. If you don't like the field
normalization, you could comment the following in line in the TermScorer
class.

score *= Similarity.norm(norms[d]);

If you put a comment around this line, youre scoring is based on the
term frequency.

If more people are interested, we could think on a little bit more
flexible ranking system within Lucene. There would be several parameters
which from the environment which could be used to rank a document.
Therefore we would need an interface where we could change the lucene
document boost factor during runtime. For example, a document's ranking
could be based on:
links pointing to that document (like Google)
last modification date,
size of the document,
term frequency,
how often was it displayed by other users, sending the same query
terms to the system
.....

Let me know if you find that idea interessting, i would like to work on
that topic.

--Bernhard



Peter Carlson wrote:

>I have noticed the same issue.
>
>>From what I understand, this is both the way it should work and a problem.
>Shorter documents which have a given term, should be more relevant because
>more of the document is about that term (i.e the term takes a greater % of
>the document). However, when there are documents of completely different
>sizes (i.e. 20 words vs. 2000 words) this assumption doesn't hold up very
>well.
>
>One solution I've heard is to extract the concepts of the documents, then
>search on those. The concepts are still difficult to extract if the document
>is too short, but it may provide a way to standardize documents. I have been
>lazily looking for an open source, academic concept extractor, but I haven't
>found one. There are companies like Semio and ActiveNavigation which provide
>this service for an expense fee.
>
>Let me know if you find anything or have other ideas.
>
>--Peter
>
>
>On 4/9/02 10:07 PM, "Melissa Mifsud" <melissamifsud@yahoo.com> wrote:
>
>>Hi,
>>
>>Documents which are shorter in length always seem to score higher in Lucene. I
>>was under the impression that the nornalization factors in the scoring
>>function used by Lucene would improve this, however, after a couple of
>>experiments, the short documents still always score the highest.
>>
>>Does anyone have any ideas as to how it is possible to make lengthier
>>documents score higher?
>>
>>Also, I would like a way to boost documents according to the amount of
>>in-links this document has.
>>
>>Has anyone implemented a type of Document.setBoost() method?
>>
>>I found a thread in the lucene-dev mailinglist where Doug Cutting mentions
>>that this would be a great feature to add to Lucene. No one followed his
>>email.
>>
>>Melissa.
>>
>
>
>--
>To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
>For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>
>
>



--
To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>
Re: Normalization of Documents [ In reply to ]
Bernhard,

I think your idea is very interesting and would be happy to help out.

Eric

On Sat, 13 Apr 2002, Bernhard Messer wrote:

> Hi,
>
> the topic you are focusing on is a never ending story in content
> retrieval in general. There is no perfect solution which fits in every
> environment. Retrieving a document's context based on a single query
> term seems to be very difficult also. In Lucene it isn't de very
> difficult to change the ranking algorithm. If you don't like the field
> normalization, you could comment the following in line in the TermScorer
> class.
>
> score *= Similarity.norm(norms[d]);
>
> If you put a comment around this line, youre scoring is based on the
> term frequency.
>
> If more people are interested, we could think on a little bit more
> flexible ranking system within Lucene. There would be several parameters
> which from the environment which could be used to rank a document.
> Therefore we would need an interface where we could change the lucene
> document boost factor during runtime. For example, a document's ranking
> could be based on:
> links pointing to that document (like Google)
> last modification date,
> size of the document,
> term frequency,
> how often was it displayed by other users, sending the same query
> terms to the system
> .....
>
> Let me know if you find that idea interessting, i would like to work on
> that topic.
>
> --Bernhard
>
>
>
> Peter Carlson wrote:
>
> >I have noticed the same issue.
> >
> >From what I understand, this is both the way it should work and a problem.
> >Shorter documents which have a given term, should be more relevant because
> >more of the document is about that term (i.e the term takes a greater % of
> >the document). However, when there are documents of completely different
> >sizes (i.e. 20 words vs. 2000 words) this assumption doesn't hold up very
> >well.
> >
> >One solution I've heard is to extract the concepts of the documents, then
> >search on those. The concepts are still difficult to extract if the document
> >is too short, but it may provide a way to standardize documents. I have been
> >lazily looking for an open source, academic concept extractor, but I haven't
> >found one. There are companies like Semio and ActiveNavigation which provide
> >this service for an expense fee.
> >
> >Let me know if you find anything or have other ideas.
> >
> >--Peter
> >
> >
> >On 4/9/02 10:07 PM, "Melissa Mifsud" <melissamifsud@yahoo.com> wrote:
> >
> >>Hi,
> >>
> >>Documents which are shorter in length always seem to score higher in Lucene. I
> >>was under the impression that the nornalization factors in the scoring
> >>function used by Lucene would improve this, however, after a couple of
> >>experiments, the short documents still always score the highest.
> >>
> >>Does anyone have any ideas as to how it is possible to make lengthier
> >>documents score higher?
> >>
> >>Also, I would like a way to boost documents according to the amount of
> >>in-links this document has.
> >>
> >>Has anyone implemented a type of Document.setBoost() method?
> >>
> >>I found a thread in the lucene-dev mailinglist where Doug Cutting mentions
> >>that this would be a great feature to add to Lucene. No one followed his
> >>email.
> >>
> >>Melissa.
> >>
> >
> >
> >--
> >To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
> >For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>
> >
> >
>
>
>
> --
> To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
> For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>
>


--
To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>
Re: Normalization of Documents [ In reply to ]
> Let me know if you find that idea interessting, i would like to work on
> that topic.

Seeing as I bought the topic up... I'm interested!!

I've been doing alot of research for my University thesis on IR and the type
of information that can be gathered from individual documents themselves and
the structure of documents which are linked in some way (as hypertext). This
information can be used to change (add/filter) what exactly is indexed,
change the ranking algorithm and boosting documents. I'm sure I could put
my research to use!

Melissa




_________________________________________________________
Do You Yahoo!?
Get your free @yahoo.com address at http://mail.yahoo.com


--
To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>
Re: Normalization of Documents [ In reply to ]
Please have these discussions on the users list instead of the developers
list.

I know that it is difficult sometimes to know since Lucene is an API, but
items that maybe of general interest to lucene users (who are mostly
developers) should go on the users list.

Thanks

--Peter

On 4/15/02 12:28 AM, "Melissa Mifsud" <melissamifsud@yahoo.com> wrote:

>
>> Let me know if you find that idea interessting, i would like to work on
>> that topic.
>
> Seeing as I bought the topic up... I'm interested!!
>
> I've been doing alot of research for my University thesis on IR and the type
> of information that can be gathered from individual documents themselves and
> the structure of documents which are linked in some way (as hypertext). This
> information can be used to change (add/filter) what exactly is indexed,
> change the ranking algorithm and boosting documents. I'm sure I could put
> my research to use!
>
> Melissa
>


--
To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>