Mailing List Archive

Lucene scoring: coord_q_d factor
Hello group,

The coord(q,d) normalisation is "a score factor based on how many of the query terms are found in the specified document." and described here:

http://lucene.apache.org/java/docs/api/org/apache/lucene/search/Similarity.html#formula_coord

Does this have a theoretical base? On what basis was the decition make to have it? Does anybody know a paper (in Information Retrieval, Information Seeking, etc.) or other more general information about this?

Best Regards,
Karl

P.S.: This is my second question about Lucene scoring (current version). It relates to the question I posted on the older scoring version. I decised to repost since most people here seemed not to read it since it relates to an old version - well actually it doesn't.
--
"Ein Herz für Kinder" - Ihre Spende hilft! Aktion: www.deutschlandsegelt.de
Unser Dankeschön: Ihr Name auf dem Segel der 1. deutschen America's Cup-Yacht!

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Lucene scoring: coord_q_d factor [ In reply to ]
Karl Koch wrote:
> The coord(q,d) normalisation is "a score factor based on how many of
> the query terms are found in the specified document." and described
> here:
>
> http://lucene.apache.org/java/docs/api/org/apache/lucene/search/Similarity.html#formula_coord
>
> Does this have a theoretical base? On what basis was the decition
> make to have it? Does anybody know a paper (in Information Retrieval,
> Information Seeking, etc.) or other more general information about
> this?

Following is quoted from: Krovetz, R. & Croft, W. B. (1992) Lexical
Ambiguity and Information Retrieval. ACM Transactions on Information
Systems, 10(2): 115-141.

Many retrieval systems represent documents and queries
by the words they contain, and base the comparison on
the number of words they have in common. The more
words the query and document have in common, the
higher the document is ranked; this is referred to as
a "coordination match." Performance is improved by
weighting query and document words using frequency
information from the collection and individual
document texts [27].

27. Salton, G. & McGill, M. Introduction to Modern Information
Retrieval. McGraw-Hill, New York, 1983.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Lucene scoring: coord_q_d factor [ In reply to ]
Hello Steven,

I looked up the paper and read the relevant part. The text quote you provided is from the introcution. I belief that quote referes to the basic purpose of an information retrieval system in general. At least to the purpose of a vector space model IR system.

If this is the theoretical justfication of the coord_q_d normalisation than it is actually replicating the the other part of the scoring formula to some degree. The entire forumla is actually concerned with this - comparing the term frequencies of query and document.

Is there any other paper that actually shows the benefit of doing this particular normalisation with coord_q_d? I am not suggesting here that it is not useful, I am just looking for evidence how the idea developed.

Karl




-------- Original-Nachricht --------
Datum: Tue, 12 Dec 2006 10:01:05 -0500
Von: Steven Rowe <sarowe@syr.edu>
An: java-user@lucene.apache.org
Betreff: Re: Lucene scoring: coord_q_d factor

> Karl Koch wrote:
> > The coord(q,d) normalisation is "a score factor based on how many of
> > the query terms are found in the specified document." and described
> > here:
> >
> >
> http://lucene.apache.org/java/docs/api/org/apache/lucene/search/Similarity.html#formula_coord
> >
> > Does this have a theoretical base? On what basis was the decition
> > make to have it? Does anybody know a paper (in Information Retrieval,
> > Information Seeking, etc.) or other more general information about
> > this?
>
> Following is quoted from: Krovetz, R. & Croft, W. B. (1992) Lexical
> Ambiguity and Information Retrieval. ACM Transactions on Information
> Systems, 10(2): 115-141.
>
> Many retrieval systems represent documents and queries
> by the words they contain, and base the comparison on
> the number of words they have in common. The more
> words the query and document have in common, the
> higher the document is ranked; this is referred to as
> a "coordination match." Performance is improved by
> weighting query and document words using frequency
> information from the collection and individual
> document texts [27].
>
> 27. Salton, G. & McGill, M. Introduction to Modern Information
> Retrieval. McGraw-Hill, New York, 1983.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org

--
Der GMX SmartSurfer hilft bis zu 70% Ihrer Onlinekosten zu sparen!
Ideal für Modem und ISDN: http://www.gmx.net/de/go/smartsurfer

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Lucene scoring: coord_q_d factor [ In reply to ]
Karl Koch wrote:
> Is there any other paper that actually shows the benefit of doing
> this particular normalisation with coord_q_d? I am not suggesting
> here that it is not useful, I am just looking for evidence how the
> idea developed.

I think it's a mischaracterization to call coordination a
"normalization". In my mind, "normalization" is something applied
equally to all documents' scores. The coordination component of a
document's score varies from document to document, and so doesn't meet
this criterion.

I repeat the citation of the book cited by the paper I cited :) :

>> Salton, G. & McGill, M. Introduction to Modern Information
>> Retrieval. McGraw-Hill, New York, 1983.

In addition to the above book, here are two other books that I've seen
cited as describing "coordination-level matching" (a.k.a. "overlap
ranking"):

Salton, G. (1968). Automatic information organization and retrieval.
New York: McGraw-Hill.

Lancaster, F.W. (1979). Information retrieval systems: Characteristics,
testing and evaluation (2nd ed.). New York: Wiley.

I don't know the answer to your larger question: why use a coordination
component in a similarity measure when other components (tf*idf) seem to
serve the same function? What you seem to be looking for is a study
that directly compares a system using a coordination component in its
similarity measure with the *same* system, varying the measure only in
that coordination is elided. Unfortunately, I know of no such study.

Good luck,
Steve


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Lucene scoring: coord_q_d factor [ In reply to ]
Hello Steven,

unfortunately I don't have access to these books right now. I will try to get hold of them. Thank you for these pointers. :)

I had a quick look at "coordination level matching" on the web and found evidence that this seemed to be an early retrieval strategy. My question is mainly, why one should use coordination level matching, if one is already doing (proper) TFxIDF based matching. When I look at Lucenes scoring forumla, it seems to me that two kinds of matching are performed and combined together in a single matching formula.

In the paper, "Exploiting the Similarity of Non-matching Terms at Retrieval Time" which can be found here:

http://www.cis.strath.ac.uk/~fabioc/papers/00-jir.pdf

it is directly compared with TFxIDF. To me, it seems that coordination level matching could be used if I don't want to use TFxIDF but not together with it. In this context, I wonder what benefit the "coordination level matching" has in combination with TFxIDF?

It is likely that I have some kind of misunderstanding here. Perhaps with your help I can untangle that a bit further. As I said earlier, I am only looking for a reasonable explaination (perhaps augmented with some evidence in literature) that makes it clear why it is used together with TFxIDF.

Thank you,
Karl



-------- Original-Nachricht --------
Datum: Tue, 12 Dec 2006 17:15:48 -0500
Von: Steven Rowe <sarowe@syr.edu>
An: java-user@lucene.apache.org
Betreff: Re: Lucene scoring: coord_q_d factor

> Karl Koch wrote:
> > Is there any other paper that actually shows the benefit of doing
> > this particular normalisation with coord_q_d? I am not suggesting
> > here that it is not useful, I am just looking for evidence how the
> > idea developed.
>
> I think it's a mischaracterization to call coordination a
> "normalization". In my mind, "normalization" is something applied
> equally to all documents' scores. The coordination component of a
> document's score varies from document to document, and so doesn't meet
> this criterion.
>
> I repeat the citation of the book cited by the paper I cited :) :
>
> >> Salton, G. & McGill, M. Introduction to Modern Information
> >> Retrieval. McGraw-Hill, New York, 1983.
>
> In addition to the above book, here are two other books that I've seen
> cited as describing "coordination-level matching" (a.k.a. "overlap
> ranking"):
>
> Salton, G. (1968). Automatic information organization and retrieval.
> New York: McGraw-Hill.
>
> Lancaster, F.W. (1979). Information retrieval systems: Characteristics,
> testing and evaluation (2nd ed.). New York: Wiley.
>
> I don't know the answer to your larger question: why use a coordination
> component in a similarity measure when other components (tf*idf) seem to
> serve the same function? What you seem to be looking for is a study
> that directly compares a system using a coordination component in its
> similarity measure with the *same* system, varying the measure only in
> that coordination is elided. Unfortunately, I know of no such study.
>
> Good luck,
> Steve
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org

--
Der GMX SmartSurfer hilft bis zu 70% Ihrer Onlinekosten zu sparen!
Ideal für Modem und ISDN: http://www.gmx.net/de/go/smartsurfer

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Lucene scoring: coord_q_d factor [ In reply to ]
On 12/13/06, Karl Koch <TheRanger@gmx.net> wrote:
> To me, it seems that coordination level matching could be used if I don't want to use TFxIDF but not together with it. In this context, I wonder what benefit the "coordination level matching" has in combination with TFxIDF?

Well, if I search for blue kangaroo, the coord is nice to get
documents with "blue" and "kangaroo" to score higher than documents
with just one term. And among documents with just one term, the idf
factor will make "kangaroo" rank above "blue", which is generally
desired.

I have seen complaints about the default similarity though, where the
coord factor does not give enough of a boost in relation to the idf of
some of the individual terms.


-Yonik
http://incubator.apache.org/solr Solr, the open-source Lucene search server

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Lucene scoring: coord_q_d factor [ In reply to ]
Do you know about any papers that discuss this?

Karl

-------- Original-Nachricht --------
Datum: Wed, 13 Dec 2006 10:31:41 -0500
Von: "Yonik Seeley" <yonik@apache.org>
An: java-user@lucene.apache.org
Betreff: Re: Lucene scoring: coord_q_d factor

> On 12/13/06, Karl Koch <TheRanger@gmx.net> wrote:
> > To me, it seems that coordination level matching could be used if I
> don't want to use TFxIDF but not together with it. In this context, I wonder
> what benefit the "coordination level matching" has in combination with
> TFxIDF?
>
> Well, if I search for blue kangaroo, the coord is nice to get
> documents with "blue" and "kangaroo" to score higher than documents
> with just one term. And among documents with just one term, the idf
> factor will make "kangaroo" rank above "blue", which is generally
> desired.
>
> I have seen complaints about the default similarity though, where the
> coord factor does not give enough of a boost in relation to the idf of
> some of the individual terms.
>
>
> -Yonik
> http://incubator.apache.org/solr Solr, the open-source Lucene search
> server
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org

--
Der GMX SmartSurfer hilft bis zu 70% Ihrer Onlinekosten zu sparen!
Ideal für Modem und ISDN: http://www.gmx.net/de/go/smartsurfer

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Lucene scoring: coord_q_d factor [ In reply to ]
On Wednesday 13 December 2006 16:42, Karl Koch wrote:
> Do you know about any papers that discuss this?

Coordination is called co-ordination In the original idf paper by
K. Spärck Jones, A statistical interpretation of term specificity
and its application in retrieval., Journal of Documentation 28,
11-21, 1972
http://www.soi.city.ac.uk/~ser/idfpapers/ksj_orig.pdf

The paper is the first one on the idf page:
http://www.soi.city.ac.uk/~ser/idf.html

Regards,
Paul Elschot

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Lucene scoring: coord_q_d factor [ In reply to ]
Hello Paul,

thank you for providing the link to that paper. I read it again, and you are right. I discovered the following text part:

"In normal term co-ordination matches, if a request and document have a frequent term in common, this counts for as much as a non-frequent one; so if a request and document share three common terms, the document is retrieved at the same level as another one sharing three rare terms with the request. But it seems we should treat matches on non-frequent terms as more valuable than ones on frequent terms, without disregarding the latter altogether. The natural solution is to correlate a term's matching value with its collection frequency."

If I do not misunderstand that extract, I would say it suggests the combination of coordination level matching with IDF. I am interested in your view and those who read this?

Are there any other papers that regard the combination of coordination level matching and TFxIDF as advantageous?

Cheers,
Karl

-------- Original-Nachricht --------
Datum: Wed, 13 Dec 2006 21:00:45 +0100
Von: Paul Elschot <paul.elschot@xs4all.nl>
An: java-user@lucene.apache.org
Betreff: Re: Lucene scoring: coord_q_d factor

> On Wednesday 13 December 2006 16:42, Karl Koch wrote:
> > Do you know about any papers that discuss this?
>
> Coordination is called co-ordination In the original idf paper by
> K. Spärck Jones, A statistical interpretation of term specificity
> and its application in retrieval., Journal of Documentation 28,
> 11-21, 1972
> http://www.soi.city.ac.uk/~ser/idfpapers/ksj_orig.pdf
>
> The paper is the first one on the idf page:
> http://www.soi.city.ac.uk/~ser/idf.html
>
> Regards,
> Paul Elschot
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org

--
Der GMX SmartSurfer hilft bis zu 70% Ihrer Onlinekosten zu sparen!
Ideal für Modem und ISDN: http://www.gmx.net/de/go/smartsurfer

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Lucene scoring: coord_q_d factor [ In reply to ]
Hi,

But isn't "coord" + TFIDF pretty intuitive? Independently, they are both useful and contribute to the final score for the match.

Otis

----- Original Message ----
From: Karl Koch <TheRanger@gmx.net>
To: java-user@lucene.apache.org
Sent: Wednesday, December 13, 2006 8:35:55 PM
Subject: Re: Lucene scoring: coord_q_d factor

Hello Paul,

thank you for providing the link to that paper. I read it again, and you are right. I discovered the following text part:

"In normal term co-ordination matches, if a request and document have a frequent term in common, this counts for as much as a non-frequent one; so if a request and document share three common terms, the document is retrieved at the same level as another one sharing three rare terms with the request. But it seems we should treat matches on non-frequent terms as more valuable than ones on frequent terms, without disregarding the latter altogether. The natural solution is to correlate a term's matching value with its collection frequency."

If I do not misunderstand that extract, I would say it suggests the combination of coordination level matching with IDF. I am interested in your view and those who read this?

Are there any other papers that regard the combination of coordination level matching and TFxIDF as advantageous?

Cheers,
Karl

-------- Original-Nachricht --------
Datum: Wed, 13 Dec 2006 21:00:45 +0100
Von: Paul Elschot <paul.elschot@xs4all.nl>
An: java-user@lucene.apache.org
Betreff: Re: Lucene scoring: coord_q_d factor

> On Wednesday 13 December 2006 16:42, Karl Koch wrote:
> > Do you know about any papers that discuss this?
>
> Coordination is called co-ordination In the original idf paper by
> K. Spärck Jones, A statistical interpretation of term specificity
> and its application in retrieval., Journal of Documentation 28,
> 11-21, 1972
> http://www.soi.city.ac.uk/~ser/idfpapers/ksj_orig.pdf
>
> The paper is the first one on the idf page:
> http://www.soi.city.ac.uk/~ser/idf.html
>
> Regards,
> Paul Elschot
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org

--
Der GMX SmartSurfer hilft bis zu 70% Ihrer Onlinekosten zu sparen!
Ideal für Modem und ISDN: http://www.gmx.net/de/go/smartsurfer

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org





---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Lucene scoring: coord_q_d factor [ In reply to ]
Karl Koch wrote:
> If I do not misunderstand that extract, I would say it suggests the combination of coordination level matching with IDF. I am interested in your view and those who read this?

I understand that sentence:
"The natural solution is to correlate a term's matching value with its
collection frequency."
exactly in that way, to combine coordination level matching with IDF.

The score for a document is the sum of the term weights w(tf, idf) for
each containing term. So you have already the combination of
coordination level matching with IDF. Now it is possible that your query
requests three terms A, B and C. Two of them (A and B) are quite often
in the collection one (C) is very rare. It could be possible that
documents are matching just C have a higher score than documents
containing A and B. To avoid this you can give the coordination a higher
influence by multiplying the sum of term weights with the coordination
as additional factor.

Sören

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Lucene scoring: coord_q_d factor [ In reply to ]
I think I understand now. I also have evidence from literature. So I would say that my question is solved. :)

Thank you, Otis, and everybody else for contributing!
Karl

-------- Original-Nachricht --------
Datum: Thu, 14 Dec 2006 09:40:31 +0100
Von: Soeren Pekrul <soeren.pekrul@gmx.de>
An: java-user@lucene.apache.org
Betreff: Re: Lucene scoring: coord_q_d factor

> Karl Koch wrote:
> > If I do not misunderstand that extract, I would say it suggests the
> combination of coordination level matching with IDF. I am interested in your
> view and those who read this?
>
> I understand that sentence:
> "The natural solution is to correlate a term's matching value with its
> collection frequency."
> exactly in that way, to combine coordination level matching with IDF.
>
> The score for a document is the sum of the term weights w(tf, idf) for
> each containing term. So you have already the combination of
> coordination level matching with IDF. Now it is possible that your query
> requests three terms A, B and C. Two of them (A and B) are quite often
> in the collection one (C) is very rare. It could be possible that
> documents are matching just C have a higher score than documents
> containing A and B. To avoid this you can give the coordination a higher
> influence by multiplying the sum of term weights with the coordination
> as additional factor.
>
> Sören
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org

--
"Ein Herz für Kinder" - Ihre Spende hilft! Aktion: www.deutschlandsegelt.de
Unser Dankeschön: Ihr Name auf dem Segel der 1. deutschen America's Cup-Yacht!

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Lucene scoring: coord_q_d factor [ In reply to ]
Soeren Pekrul wrote:
> The score for a document is the sum of the term weights w(tf, idf) for
> each containing term. So you have already the combination of
> coordination level matching with IDF. Now it is possible that your query
> requests three terms A, B and C. Two of them (A and B) are quite often
> in the collection one (C) is very rare. It could be possible that
> documents are matching just C have a higher score than documents
> containing A and B. To avoid this you can give the coordination a higher
> influence by multiplying the sum of term weights with the coordination
> as additional factor.

Addendum:
For the query Q(A, B, C) with
A: df++ (ifd--)
B: df++ (idf--)
C: df-- (idf++)
the user would probably expect the following ranking:
1. D(A, B, C)
2. D(A, C), D(B, C)
3. D(A, B)
4. D(C)
5. D(A), D(B)

Sören

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Lucene scoring: coord_q_d factor [ In reply to ]
FYI: The Wiki has a fair number of resources on IR: http://
wiki.apache.org/jakarta-lucene/InformationRetrieval (I have added a
link to this conversation, which contains a lot of useful information)

Karl, if you are so inclined, please feel free to add any of the
references you have found that have been helpful that aren't already
on this page (anyone can edit the Wiki with an login)

-Grant

On Dec 14, 2006, at 4:59 AM, Soeren Pekrul wrote:

> Soeren Pekrul wrote:
>> The score for a document is the sum of the term weights w(tf, idf)
>> for each containing term. So you have already the combination of
>> coordination level matching with IDF. Now it is possible that your
>> query requests three terms A, B and C. Two of them (A and B) are
>> quite often in the collection one (C) is very rare. It could be
>> possible that documents are matching just C have a higher score
>> than documents containing A and B. To avoid this you can give the
>> coordination a higher influence by multiplying the sum of term
>> weights with the coordination as additional factor.
>
> Addendum:
> For the query Q(A, B, C) with
> A: df++ (ifd--)
> B: df++ (idf--)
> C: df-- (idf++)
> the user would probably expect the following ranking:
> 1. D(A, B, C)
> 2. D(A, C), D(B, C)
> 3. D(A, B)
> 4. D(C)
> 5. D(A), D(B)
>
> Sören
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

--------------------------
Grant Ingersoll
Center for Natural Language Processing
http://www.cnlp.org

Read the Lucene Java FAQ at http://wiki.apache.org/jakarta-lucene/
LuceneFAQ



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Lucene scoring: coord_q_d factor [ In reply to ]
Karl Koch wrote:
> Are there any other papers that regard the combination of coordination level matching and TFxIDF as advantageous?

We independently developed coordination-level matching combined with
TFxIDF when I worked at Apple. This is documented in:

http://www.informatik.uni-trier.de/~ley/db/conf/trec/trec1996.html#RoseS96

(I had left Apple when this was written, but it largely describes work
done while I was there.)

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org