Mailing List Archive

question about TermQuery
I'm looking through the TermQuery code (and generally trying to
understand exactly how the searching works) and I found this code that
looks suspicious to me. It is very likeley that I just don't understand
what's going on, but there is a chance that this is a bug, so I wanted
to ask for clarification / review from Doug and others.

In the TermQuery.normalize(float norm), weight is being multiplied first
by the normalization factor (the argument) and then by the idf, that was
stored in the TermQuery before. Although I can't say for sure that this
is wrong, it does look suspect. First, idf is already factored into
weight in the sumOfSquaredWeights() method, and second, if normalize is
called multiple times, idf will be multiplied into weight over and
over... Plus the comment in normalize doesn't really make sense, and the
way the code is written makes me think that this is a problem caused by
a CVS merge conflict, and that only the line "weight *= norm" should be
in that method. Am I right?

======================================================
final float sumOfSquaredWeights(Searcher searcher) throws IOException {
idf = Similarity.idf(term, searcher);
weight = idf * boost;
return weight * weight; // square term weights
}

final void normalize(float norm) {
weight *= norm; // normalize for query
weight *= idf; // factor from document
}
======================================================
RE: question about TermQuery [ In reply to ]
It is correct to include idf twice. Recall that the weighting is roughly:
(tf_q * idf_t / norm_q) * (tf_d * idf_t / norm_d)

The TermQuery.weight field has all of this that is not document specific,
i.e., everything but in this but tf_d and norm_d, so weight should be:
(tf_q * idf_t / norm_q) * idf_t

The code is a little different, since we don't calculate tf_q, the frequency
of the term in the query, assuming that it is one, and instead use a 'boost'
factor. But the term's idf (idf_t) should really be in there twice.

The query normalization factor, 1/norm_q, is calculated based on the value
of the sumOfSquaredWeights, and is passed back in through the normalize()
call.

Normalize() is only called once per call to sumOfSquaredWeights. These
calls are initiated on a Query in the method Query.scorer():

static Scorer scorer(Query query, Searcher searcher, IndexReader reader)
throws IOException {
query.prepare(reader);
float sum = query.sumOfSquaredWeights(searcher);
float norm = 1.0f / (float)Math.sqrt(sum);
query.normalize(norm);
return query.scorer(reader);
}

So it all looks okay to me.

Doug

> -----Original Message-----
> From: Dmitry Serebrennikov [mailto:dmitrys@earthlink.net]
> Sent: Sunday, October 07, 2001 2:19 PM
> To: lucene-dev@jakarta.apache.org
> Subject: question about TermQuery
>
>
> I'm looking through the TermQuery code (and generally trying to
> understand exactly how the searching works) and I found this
> code that
> looks suspicious to me. It is very likeley that I just don't
> understand
> what's going on, but there is a chance that this is a bug, so
> I wanted
> to ask for clarification / review from Doug and others.
>
> In the TermQuery.normalize(float norm), weight is being
> multiplied first
> by the normalization factor (the argument) and then by the
> idf, that was
> stored in the TermQuery before. Although I can't say for sure
> that this
> is wrong, it does look suspect. First, idf is already factored into
> weight in the sumOfSquaredWeights() method, and second, if
> normalize is
> called multiple times, idf will be multiplied into weight over and
> over... Plus the comment in normalize doesn't really make
> sense, and the
> way the code is written makes me think that this is a problem
> caused by
> a CVS merge conflict, and that only the line "weight *= norm"
> should be
> in that method. Am I right?
>
> ======================================================
> final float sumOfSquaredWeights(Searcher searcher) throws
> IOException {
> idf = Similarity.idf(term, searcher);
> weight = idf * boost;
> return weight * weight; // square term weights
> }
>
> final void normalize(float norm) {
> weight *= norm; // normalize for query
> weight *= idf; // factor from document
> }
> ======================================================
>
>
Re: question about TermQuery [ In reply to ]
Great. Thanks for checking. I'm glad that this was a false alarm.
It also seems that because sumOfSquareWeights reassigns all variables
again, the TermQuery instances can be reused in subsequent queries,
although the can't be used concurrently for multiple queries. Are Query
objects generally suiteable for reuse? So, for example, could they be
used as keys for caching query results? My guess is that they can, as
long as they are not used for executing the query.

Doug Cutting wrote:

>It is correct to include idf twice. Recall that the weighting is roughly:
> (tf_q * idf_t / norm_q) * (tf_d * idf_t / norm_d)
>
>The TermQuery.weight field has all of this that is not document specific,
>i.e., everything but in this but tf_d and norm_d, so weight should be:
> (tf_q * idf_t / norm_q) * idf_t
>
>The code is a little different, since we don't calculate tf_q, the frequency
>of the term in the query, assuming that it is one, and instead use a 'boost'
>factor. But the term's idf (idf_t) should really be in there twice.
>
>The query normalization factor, 1/norm_q, is calculated based on the value
>of the sumOfSquaredWeights, and is passed back in through the normalize()
>call.
>
>Normalize() is only called once per call to sumOfSquaredWeights. These
>calls are initiated on a Query in the method Query.scorer():
>
> static Scorer scorer(Query query, Searcher searcher, IndexReader reader)
> throws IOException {
> query.prepare(reader);
> float sum = query.sumOfSquaredWeights(searcher);
> float norm = 1.0f / (float)Math.sqrt(sum);
> query.normalize(norm);
> return query.scorer(reader);
> }
>
>So it all looks okay to me.
>
>Doug
>
>>-----Original Message-----
>>From: Dmitry Serebrennikov [mailto:dmitrys@earthlink.net]
>>Sent: Sunday, October 07, 2001 2:19 PM
>>To: lucene-dev@jakarta.apache.org
>>Subject: question about TermQuery
>>
>>
>>I'm looking through the TermQuery code (and generally trying to
>>understand exactly how the searching works) and I found this
>>code that
>>looks suspicious to me. It is very likeley that I just don't
>>understand
>>what's going on, but there is a chance that this is a bug, so
>>I wanted
>>to ask for clarification / review from Doug and others.
>>
>>In the TermQuery.normalize(float norm), weight is being
>>multiplied first
>>by the normalization factor (the argument) and then by the
>>idf, that was
>>stored in the TermQuery before. Although I can't say for sure
>>that this
>>is wrong, it does look suspect. First, idf is already factored into
>>weight in the sumOfSquaredWeights() method, and second, if
>>normalize is
>>called multiple times, idf will be multiplied into weight over and
>>over... Plus the comment in normalize doesn't really make
>>sense, and the
>>way the code is written makes me think that this is a problem
>>caused by
>>a CVS merge conflict, and that only the line "weight *= norm"
>>should be
>>in that method. Am I right?
>>
>>======================================================
>> final float sumOfSquaredWeights(Searcher searcher) throws
>>IOException {
>> idf = Similarity.idf(term, searcher);
>> weight = idf * boost;
>> return weight * weight; // square term weights
>> }
>>
>> final void normalize(float norm) {
>> weight *= norm; // normalize for query
>> weight *= idf; // factor from document
>> }
>>======================================================
>>
>>
>