Mailing List Archive: Qs re: document scoring and semantics

(1) The FAQ states the following:

"For the record, Lucene's scoring algorithm is, roughly:

score_d = sum_t(tf_q * idf_t / norm_q * tf_d * idf_t / norm_d_t *
boost_t) * coord_q_d

where:
score_d : score for document d
sum_t : sum for all terms t
tf_q : the square root of the frequency of t in the query
tf_d : the square root of the frequency of t in d
idf_t : log(numDocs/docFreq_t+1) + 1.0
numDocs : number of documents in index
docFreq_t : number of documents containing t
norm_q : sqrt(sum_t((tf_q*idf_t)^2))
norm_d_t : square root of number of tokens in d in the same field as t
boost_t : the user-specified boost for term t
coord_q_d : number of terms in both query and document / number of
terms in query"

Is either of the expressions below the correct parenthesization of the
expression above? If not, what is?

score_d = sum_t(tf_q * (idf_t / norm_q) * tf_d * (idf_t / norm_d_t) *
boost_t) * coord_q_d

score_d = sum_t((tf_q * idf_t) / (norm_q * tf_d * idf_t) / (norm_d_t *
boost_t)) * coord_q_d

(2) I'm trying to make sure that I have a handle on the semantics of
BooleanQuery, especially as they relate to the scoring mechanism. I would
appreciate it if someone would correct any misapprehensions in the
following descriptions.

* BooleanQuery.add(query, false, false) is equivalent to Boolean OR.
Only one such query must be satisfied in a given document D (via the
appearance of the associated term(s)/phrase(s) in D) in order for the
score for D to be > 0.

* BooleanQuery.add(query, true, false) is equivalent to Boolean AND. All
such queries must be satisfied (as above) in D in order for score(D) to
be > 0.

* BooleanQuery.add(query, false, true) is equivalent to Boolean NAND.
All such queries must be *not* satisfied in D in order for score(D) to be
> 0.

If these, and the semantics for "required" and "prohibited" (in
BooleanQuery.add()), are accurate, then the semantics seem rather odd to
me, so I'm hoping that someone will tell me that I'm wrong. :) In
particular, it seems to me that if you create a BooleanQuery and add a
single TermQuery tq to it with add(tq, false, false) then, according to
the semantics of "required" and "prohibited", *any* document will match
the query...which clearly doesn't make sense.

(3) Somewhat unrelated question: what are the semantics and purpose of
FilteredTermEnum.difference()? (I see where and how it's used in the
source but I don't understand the motivation.)

(4) I'm still somewhat puzzled by MultiTermQuery. I believe that Lucene
essentially uses the vector model to identify (and quantify)
query-document matching. A vector model query consists of one or more
terms, so I would expect MultiTermQuery to be a common query
type. However, the documentation doesn't make it clear how one binds
several terms together into a MultiTermQuery, and in particular I don't
see how one could separately set the boost for different terms in a
MultiTermQuery. In my current prototype system, I've been using
BooleanQuery to bind my terms together into a single query, and I'm not
sure that the semantics for BooleanQuery is what I want (hence question
(2) above). Could someone please explain what MultiTermQuery is for,
how it should be used, etc.?

Thanks--

Joshua O'Madadhain

jmadden@ics.uci.edu...Obscurium Per Obscurius...www.ics.uci.edu/~jmadden
Joshua Madden: Information Scientist, Musician, Philosopher-At-Tall
It's that moment of dawning comprehension that I live for--Bill Watterson
My opinions are too rational and insightful to be those of any organization.

--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>

> From: Joshua O'Madadhain [mailto:jmadden@ics.uci.edu]
>
> Is either of the expressions below the correct parenthesization of the
> expression above? If not, what is?
>
> score_d = sum_t(tf_q * (idf_t / norm_q) * tf_d * (idf_t / norm_d_t) *
> boost_t) * coord_q_d

That's correct. The tf*idf weights are normalized for document length. I
would parenthesize it:
((tf_q * idf_t) / norm_q) * ((tf_d * idf_t) / norm_d_t) * boost_t

> (2) I'm trying to make sure that I have a handle on the semantics of
> BooleanQuery [ ... ]
>
> * BooleanQuery.add(query, false, false) is equivalent to Boolean OR.
> * BooleanQuery.add(query, true, false) is equivalent to
> Boolean AND. All
> * BooleanQuery.add(query, false, true) is equivalent to
> Boolean NAND.

That's more-or-less correct.

To be precise: a binary boolean OR is implemented by:
BooleanQuery query = new BooleanQuery();
BooleanQuery.add(clause1, false, false);
BooleanQuery.add(clause2, false, false);

A binary boolean AND is implemented by:
BooleanQuery query = new BooleanQuery();
BooleanQuery.add(clause1, true, false);
BooleanQuery.add(clause2, true, false);

A binary boolean NAND is implemented by either:
BooleanQuery query = new BooleanQuery();
BooleanQuery.add(clause1, true, false);
BooleanQuery.add(clause2, false, true);
or
BooleanQuery query = new BooleanQuery();
BooleanQuery.add(clause1, false, false);
BooleanQuery.add(clause2, false, true);

> If these, and the semantics for "required" and "prohibited" (in
> BooleanQuery.add()), are accurate, then the semantics seem
> rather odd to
> me, so I'm hoping that someone will tell me that I'm wrong. :) In
> particular, it seems to me that if you create a BooleanQuery and add a
> single TermQuery tq to it with add(tq, false, false) then,
> according to
> the semantics of "required" and "prohibited", *any* document
> will match
> the query...which clearly doesn't make sense.

That is not the case. Such a query will return all documents containing the
term. This is equivalent to a unary OR. A document which does not contain
some query term is never returned.

> (3) Somewhat unrelated question: what are the semantics and purpose of
> FilteredTermEnum.difference()? (I see where and how it's used in the
> source but I don't understand the motivation.)

I did not implement this and cannot speak for it, but it appears to be
unused.

> (4) I'm still somewhat puzzled by MultiTermQuery.
> Could someone please explain what MultiTermQuery
> is for,
> how it should be used, etc.?

MultiTermQuery is a base class used to implement FuzzyQuery and
WildcardQuery. These queries generate sets of terms by enumerating terms
from the index and filtering. Ideally it would not be public, as it is not
intended for end users.

Doug

--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>