(1) The FAQ states the following:
"For the record, Lucene's scoring algorithm is, roughly:
score_d = sum_t(tf_q * idf_t / norm_q * tf_d * idf_t / norm_d_t *
boost_t) * coord_q_d
where:
score_d : score for document d
sum_t : sum for all terms t
tf_q : the square root of the frequency of t in the query
tf_d : the square root of the frequency of t in d
idf_t : log(numDocs/docFreq_t+1) + 1.0
numDocs : number of documents in index
docFreq_t : number of documents containing t
norm_q : sqrt(sum_t((tf_q*idf_t)^2))
norm_d_t : square root of number of tokens in d in the same field as t
boost_t : the user-specified boost for term t
coord_q_d : number of terms in both query and document / number of
terms in query"
Is either of the expressions below the correct parenthesization of the
expression above? If not, what is?
score_d = sum_t(tf_q * (idf_t / norm_q) * tf_d * (idf_t / norm_d_t) *
boost_t) * coord_q_d
score_d = sum_t((tf_q * idf_t) / (norm_q * tf_d * idf_t) / (norm_d_t *
boost_t)) * coord_q_d
(2) I'm trying to make sure that I have a handle on the semantics of
BooleanQuery, especially as they relate to the scoring mechanism. I would
appreciate it if someone would correct any misapprehensions in the
following descriptions.
* BooleanQuery.add(query, false, false) is equivalent to Boolean OR.
Only one such query must be satisfied in a given document D (via the
appearance of the associated term(s)/phrase(s) in D) in order for the
score for D to be > 0.
* BooleanQuery.add(query, true, false) is equivalent to Boolean AND. All
such queries must be satisfied (as above) in D in order for score(D) to
be > 0.
* BooleanQuery.add(query, false, true) is equivalent to Boolean NAND.
All such queries must be *not* satisfied in D in order for score(D) to be
> 0.
If these, and the semantics for "required" and "prohibited" (in
BooleanQuery.add()), are accurate, then the semantics seem rather odd to
me, so I'm hoping that someone will tell me that I'm wrong. :) In
particular, it seems to me that if you create a BooleanQuery and add a
single TermQuery tq to it with add(tq, false, false) then, according to
the semantics of "required" and "prohibited", *any* document will match
the query...which clearly doesn't make sense.
(3) Somewhat unrelated question: what are the semantics and purpose of
FilteredTermEnum.difference()? (I see where and how it's used in the
source but I don't understand the motivation.)
(4) I'm still somewhat puzzled by MultiTermQuery. I believe that Lucene
essentially uses the vector model to identify (and quantify)
query-document matching. A vector model query consists of one or more
terms, so I would expect MultiTermQuery to be a common query
type. However, the documentation doesn't make it clear how one binds
several terms together into a MultiTermQuery, and in particular I don't
see how one could separately set the boost for different terms in a
MultiTermQuery. In my current prototype system, I've been using
BooleanQuery to bind my terms together into a single query, and I'm not
sure that the semantics for BooleanQuery is what I want (hence question
(2) above). Could someone please explain what MultiTermQuery is for,
how it should be used, etc.?
Thanks--
Joshua O'Madadhain
jmadden@ics.uci.edu...Obscurium Per Obscurius...www.ics.uci.edu/~jmadden
Joshua Madden: Information Scientist, Musician, Philosopher-At-Tall
It's that moment of dawning comprehension that I live for--Bill Watterson
My opinions are too rational and insightful to be those of any organization.
--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>
"For the record, Lucene's scoring algorithm is, roughly:
score_d = sum_t(tf_q * idf_t / norm_q * tf_d * idf_t / norm_d_t *
boost_t) * coord_q_d
where:
score_d : score for document d
sum_t : sum for all terms t
tf_q : the square root of the frequency of t in the query
tf_d : the square root of the frequency of t in d
idf_t : log(numDocs/docFreq_t+1) + 1.0
numDocs : number of documents in index
docFreq_t : number of documents containing t
norm_q : sqrt(sum_t((tf_q*idf_t)^2))
norm_d_t : square root of number of tokens in d in the same field as t
boost_t : the user-specified boost for term t
coord_q_d : number of terms in both query and document / number of
terms in query"
Is either of the expressions below the correct parenthesization of the
expression above? If not, what is?
score_d = sum_t(tf_q * (idf_t / norm_q) * tf_d * (idf_t / norm_d_t) *
boost_t) * coord_q_d
score_d = sum_t((tf_q * idf_t) / (norm_q * tf_d * idf_t) / (norm_d_t *
boost_t)) * coord_q_d
(2) I'm trying to make sure that I have a handle on the semantics of
BooleanQuery, especially as they relate to the scoring mechanism. I would
appreciate it if someone would correct any misapprehensions in the
following descriptions.
* BooleanQuery.add(query, false, false) is equivalent to Boolean OR.
Only one such query must be satisfied in a given document D (via the
appearance of the associated term(s)/phrase(s) in D) in order for the
score for D to be > 0.
* BooleanQuery.add(query, true, false) is equivalent to Boolean AND. All
such queries must be satisfied (as above) in D in order for score(D) to
be > 0.
* BooleanQuery.add(query, false, true) is equivalent to Boolean NAND.
All such queries must be *not* satisfied in D in order for score(D) to be
> 0.
If these, and the semantics for "required" and "prohibited" (in
BooleanQuery.add()), are accurate, then the semantics seem rather odd to
me, so I'm hoping that someone will tell me that I'm wrong. :) In
particular, it seems to me that if you create a BooleanQuery and add a
single TermQuery tq to it with add(tq, false, false) then, according to
the semantics of "required" and "prohibited", *any* document will match
the query...which clearly doesn't make sense.
(3) Somewhat unrelated question: what are the semantics and purpose of
FilteredTermEnum.difference()? (I see where and how it's used in the
source but I don't understand the motivation.)
(4) I'm still somewhat puzzled by MultiTermQuery. I believe that Lucene
essentially uses the vector model to identify (and quantify)
query-document matching. A vector model query consists of one or more
terms, so I would expect MultiTermQuery to be a common query
type. However, the documentation doesn't make it clear how one binds
several terms together into a MultiTermQuery, and in particular I don't
see how one could separately set the boost for different terms in a
MultiTermQuery. In my current prototype system, I've been using
BooleanQuery to bind my terms together into a single query, and I'm not
sure that the semantics for BooleanQuery is what I want (hence question
(2) above). Could someone please explain what MultiTermQuery is for,
how it should be used, etc.?
Thanks--
Joshua O'Madadhain
jmadden@ics.uci.edu...Obscurium Per Obscurius...www.ics.uci.edu/~jmadden
Joshua Madden: Information Scientist, Musician, Philosopher-At-Tall
It's that moment of dawning comprehension that I live for--Bill Watterson
My opinions are too rational and insightful to be those of any organization.
--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>