Mailing List Archive: Deeper Ranking Issues in Lucene

I’ve been working on ranking/scoring issues for full text search for
years. When I try to implement different ranking inside the Lucene engine,
I face some problems. I list some of my experience and questions here.

The major task I want to implement inside Lucene is different
ranking/scoring algorithms. I may not find the correct source of
information, but I really cannot find a detailed document describing the
relations among ranking/scoring related classes in Lucene on web. “Lucene
in Action” concerns mostly about applications level usage but not these
lower-level APIs.

(1) The first thing I tried is the abstract class Scorer. The description
for this class is:
/** Expert: Implements scoring for a class of queries. */

If you looked IndexSearcher class, one possible search process is
implemented inside
public TopDocs search(Query query, Filter filter, final int nDocs). If you
look in detail how it is implemented, you will find out it first acquires
such a scorer instance by:
Scorer scorer = query.weight(this).scorer(reader);

The magic happens inside the method call scorer.score(new HitCollector());
What the score method does is something similar to an Iterator. The
scorer.score continuously call scorer.next() to acquire next qualified
document (I guess the qualified documents inside this iterator is from
BitSet operations from the given query. I did not looked into the detail
implementation of that), and calls scorer.score() to get the score for the
current document that the iterator pointed at.

What the HitCollector does is merely simple. It only implement a method
called collect(int doc, float score). This method is called every time
when a new document’s score are calculated. In the IndexSearcher.search
method, the documents and their scores are sent to a PriorityQueue and
ranked according to the scores.

(2) My first plan is to modify the scorer.score()method. Unfortunately, I
found this is extremely complex. score() is an abstract method which is
implemented inside its subclasses like Boolean Scorer, Conjunction Scorer,
Phrase Score, … Since I do not need to consider the complex Boolean query
syntax (in my experiments, query is defined as a list of terms connected
by disjunctions), I implement the score method inside Scorer rather inside
sub classes.

What I did is every time when Scorer.score are called, I pass the current
document number via doc(), and read out the term frequencies from
IndexReader. getTermFreqVector(int docNumber, String field) method. I
found this operation is super slow. The major cost is spent on the method
getTermFreqVector.

(3) Later I notice in the implementation in TermScorer, there is a
function call
IndexReader.termDocs.read(docs, freqs); // refill buffer
And I read the comments for this function in IndexReader, it says

/** Attempts to read multiple entries from the enumeration, up to length of
* docs. Document numbers are stored in docs, and term
* frequencies are stored in freqs. The freqs array must
be as
* long as the docs array.
*
* Returns the number of entries read. Zero is only returned when the
* stream has been exhausted. */

So different from IndexReader.getTermFreqVector, which read out term-freq
vectors for a document, this function read doc-freq vectors for a term. I
find this method call is extremely faster. At least two magnitudes faster
than getTermFreqVector, if I want to get all term-freqs for given terms
for all docs. I do not know the reason why there is such difference, I
cannot find a document describing the implementation difference and
purpose among these two.

(4) About extending Lucene’s scoring function.
If we want to implement any arbitrary ranking/scoring for term-frequency
based algorithms, we need the scoring fits the following framework

Any term-frequency scoring can be expressed in the following ways (let’s
simplify that there is only one field so that we can ignore boost factor
at this time):

Sigma (t in q) [TermWeight(t in d)*TermWeight(t in q) / lengthNorm(d) /
lengthNorm(q)]

Since lengthNorm(q) is the same for each document, it will not affect
ranking, we just ignore it. We further separate term weight into three
parts, term weight related to document, term weight related to query, term
weight related to corpus (not related to document or query) and document
length norm.

Sigma (t in q) [TW(t in d)*TW(t in q) * TW (t) / lengthNorm(d)]

We can notice this matches Lucene’ ranking function

Sigma (t in q) [tf(t in d)*1*idf (t) / lengthNorm(d)]. To save
computational cost, lengthNorm(d) is pre-calculated when indexing the
corpus. So Lucene’ lengthNorm(d) does not involve any corpus statistics
into calculation, it is merely the # of terms inside this document. On the
other hand, it treats the term weight in query is the same as 1. That is
Lucene does not differentiate terms important inside query. If we want to
emphasis a term twice we need to put two terms inside the query. For
example, instead of search “boat” we put “boat boat”, do I understand
correctly?

So, generally, if we can update the lengthNorm(d) inside the indexing code
or via post-processing after all documents are indexed, Lucene can
implement any arbitrary ranking function such as BM25. A pity is that it
does not directly support query term weight in the query language.

(5) My final big question would be how Lucene really implement
ranking/scoring. We could notice there are two possible strategies. Each
of them will result in different flexibility if we need modify current
ranking algorithms.

The first strategy is Lucene generate document scores in a document by
document manner. In Scorer.score, we notice the framework is each time the
scorer meet a new document, Lucene will generate a score for this
document. This framework is simple and intuitive. And all we discussed in
(4) will fit this framework. When you processing the current document,
lengthNorm(d) can be read, even if TW(t in d) has relation to
lengthNorm(d) we can calculate that accordingly. Unfortunately, I cannot
find any low-level code could be thinked related to this implementation
strategy. This makes me think the Scorer.score method is not the real
place Lucene implement its ranking. I am pretty confused about this part.
Who can help me with this?

The second strategy is Lucene generate document scores in a term by term
manner. For each term inside the query, Lucene calls termDocs.read(docs,
freqs) to accumulate scores over the specific term dimension over all the
documents. Under this framework (which I feel is current Lucene’s
implementation), what we discussed in (4) is not held. We need extend
current Lucene’s tf(float freq) function to tf(float freq, float norm), so
that arbitrary ranking could be implemented.

(6) I have implemented a search module outside Lucene IndexSearcher which
could implement any arbitrary ranking over vector query (a list of
disjunctive terms). The term-by-term implementation is much faster than my
previous document-by-document implementation. But I currently still cannot
encode the ranking module under Lucene’ Boolean engines, that is only rank
the documents retrieved (only rank the documents satisfy the condition
specified by the query).

__________________________________________________________________

I do not know whether I described my points clear, it is complex and hard
to write in a short plain text message. Maybe I should post a PDF version
tech-report online so that the problem is stated clearer. I am not sure my
understanding is correct. Thanks for any help and comments.

Xiangyu Jin
Department of Computer Science
University of Virginia
Charlottesville, VA 22903

Hi,

I'm not sure anyone can fully address all your questions, but I
thought I would point you at http://lucene.apache.org/java/docs/
scoring.html if you haven't already started there. It has some
details on scoring, as well as pointers into lower level details.

Some comments are also inline below, which are just a few thoughts on
some of your questions, but nothing in-depth like I think you are
looking for. I am not able to answer in more detail at this time,
but can give you some pointers on where to look.

Hope it helps.

-Grant

On Apr 2, 2007, at 3:35 PM, xj3a@cs.virginia.edu wrote:

> I’ve been working on ranking/scoring issues for full text search for
> years. When I try to implement different ranking inside the Lucene
> engine,
> I face some problems. I list some of my experience and questions here.
>
> The major task I want to implement inside Lucene is different
> ranking/scoring algorithms. I may not find the correct source of
> information, but I really cannot find a detailed document
> describing the
> relations among ranking/scoring related classes in Lucene on web.
> “Lucene
> in Action” concerns mostly about applications level usage but not
> these
> lower-level APIs.
>
> (1) The first thing I tried is the abstract class Scorer. The
> description
> for this class is:
> /** Expert: Implements scoring for a class of queries. */
>
> If you looked IndexSearcher class, one possible search process is
> implemented inside
> public TopDocs search(Query query, Filter filter, final int nDocs).
> If you
> look in detail how it is implemented, you will find out it first
> acquires
> such a scorer instance by:
> Scorer scorer = query.weight(this).scorer(reader);
>
> The magic happens inside the method call scorer.score(new
> HitCollector());
> What the score method does is something similar to an Iterator. The
> scorer.score continuously call scorer.next() to acquire next qualified
> document (I guess the qualified documents inside this iterator is from
> BitSet operations from the given query. I did not looked into the
> detail
> implementation of that), and calls scorer.score() to get the score
> for the
> current document that the iterator pointed at.
>
> What the HitCollector does is merely simple. It only implement a
> method
> called collect(int doc, float score). This method is called every time
> when a new document’s score are calculated. In the
> IndexSearcher.search
> method, the documents and their scores are sent to a PriorityQueue and
> ranked according to the scores.
>
> (2) My first plan is to modify the scorer.score()method.
> Unfortunately, I
> found this is extremely complex. score() is an abstract method
> which is
> implemented inside its subclasses like Boolean Scorer, Conjunction
> Scorer,
> Phrase Score, … Since I do not need to consider the complex Boolean
> query
> syntax (in my experiments, query is defined as a list of terms
> connected
> by disjunctions), I implement the score method inside Scorer rather
> inside
> sub classes.
>
> What I did is every time when Scorer.score are called, I pass the
> current
> document number via doc(), and read out the term frequencies from
> IndexReader. getTermFreqVector(int docNumber, String field) method. I
> found this operation is super slow. The major cost is spent on the
> method
> getTermFreqVector.

Term Vectors are stored differently from TermDocs, so I am not
surprised that they are slower. There probably could be some
optimizations made to their storage, but I would guess that most
people don't use Term Vecs all that much, so no one has looked too
closely at where optimizations could be made. They almost certainly
are not using them to loop over all the docs in the system. Most
scenarios, I believe are for highlighting results on a doc-by-doc
basis or for relevance feedback/more like this. Think of Term
Vectors as an after the fact addition to Lucene for convenience
purposes, not for scoring/performance.

>
> (3) Later I notice in the implementation in TermScorer, there is a
> function call
> IndexReader.termDocs.read(docs, freqs); // refill buffer
> And I read the comments for this function in IndexReader, it says
>
> /** Attempts to read multiple entries from the enumeration, up to
> length of
> * docs. Document numbers are stored in docs, and
> term
> * frequencies are stored in freqs. The freqs
> array must
> be as
> * long as the docs array.
> *
> * Returns the number of entries read. Zero is only returned
> when the
> * stream has been exhausted. */
>
> So different from IndexReader.getTermFreqVector, which read out
> term-freq
> vectors for a document, this function read doc-freq vectors for a
> term. I
> find this method call is extremely faster. At least two magnitudes
> faster
> than getTermFreqVector, if I want to get all term-freqs for given
> terms
> for all docs. I do not know the reason why there is such difference, I
> cannot find a document describing the implementation difference and
> purpose among these two.
>
> (4) About extending Lucene’s scoring function.
> If we want to implement any arbitrary ranking/scoring for term-
> frequency
> based algorithms, we need the scoring fits the following framework
>
> Any term-frequency scoring can be expressed in the following ways
> (let’s
> simplify that there is only one field so that we can ignore boost
> factor
> at this time):
>
> Sigma (t in q) [.TermWeight(t in d)*TermWeight(t in q) / lengthNorm
> (d) /
> lengthNorm(q)]
>
> Since lengthNorm(q) is the same for each document, it will not affect
> ranking, we just ignore it. We further separate term weight into three
> parts, term weight related to document, term weight related to
> query, term
> weight related to corpus (not related to document or query) and
> document
> length norm.
>
> Sigma (t in q) [TW(t in d)*TW(t in q) * TW (t) / lengthNorm(d)]
>
> We can notice this matches Lucene’ ranking function
>
> Sigma (t in q) [tf(t in d)*1*idf (t) / lengthNorm(d)]. To save
> computational cost, lengthNorm(d) is pre-calculated when indexing the
> corpus. So Lucene’ lengthNorm(d) does not involve any corpus
> statistics
> into calculation, it is merely the # of terms inside this document.
> On the
> other hand, it treats the term weight in query is the same as 1.
> That is
> Lucene does not differentiate terms important inside query. If we
> want to
> emphasis a term twice we need to put two terms inside the query. For
> example, instead of search “boat” we put “boat boat”, do I understand
> correctly?
>
> So, generally, if we can update the lengthNorm(d) inside the
> indexing code
> or via post-processing after all documents are indexed, Lucene can
> implement any arbitrary ranking function such as BM25. A pity is
> that it
> does not directly support query term weight in the query language.
>

We like patches and will at least consider and discuss most any patch
that is well thought out, backward compatible and tested and where
the author makes a good case for it.

> (5) My final big question would be how Lucene really implement
> ranking/scoring. We could notice there are two possible strategies.
> Each
> of them will result in different flexibility if we need modify current
> ranking algorithms.
>
> The first strategy is Lucene generate document scores in a document by
> document manner. In Scorer.score, we notice the framework is each
> time the
> scorer meet a new document, Lucene will generate a score for this
> document. This framework is simple and intuitive. And all we
> discussed in
> (4) will fit this framework. When you processing the current document,
> lengthNorm(d) can be read, even if TW(t in d) has relation to
> lengthNorm(d) we can calculate that accordingly. Unfortunately, I
> cannot
> find any low-level code could be thinked related to this
> implementation
> strategy. This makes me think the Scorer.score method is not the real
> place Lucene implement its ranking. I am pretty confused about this
> part.
> Who can help me with this?
>

You might take a look at how Query/Weight/Scorers are implemented.
For instance, I just added a BoostingTermQuery to Java version that
implements this trifecta.

> The second strategy is Lucene generate document scores in a term by
> term
> manner. For each term inside the query, Lucene calls termDocs.read
> (docs,
> freqs) to accumulate scores over the specific term dimension over
> all the
> documents. Under this framework (which I feel is current Lucene’s
> implementation), what we discussed in (4) is not held. We need extend
> current Lucene’s tf(float freq) function to tf(float freq, float
> norm), so
> that arbitrary ranking could be implemented.

There is some discussion of flexible indexing approaches on Lucene
Java, which also, to me anyway, implies flexible scoring. You might
search that archive for "flexible indexing" to see if anything
strikes you.

>
> (6) I have implemented a search module outside Lucene IndexSearcher
> which
> could implement any arbitrary ranking over vector query (a list of
> disjunctive terms). The term-by-term implementation is much faster
> than my
> previous document-by-document implementation. But I currently still
> cannot
> encode the ranking module under Lucene’ Boolean engines, that is
> only rank
> the documents retrieved (only rank the documents satisfy the condition
> specified by the query).
>
> __________________________________________________________________
>
> I do not know whether I described my points clear, it is complex
> and hard
> to write in a short plain text message. Maybe I should post a PDF
> version
> tech-report online so that the problem is stated clearer. I am not
> sure my
> understanding is correct. Thanks for any help and comments.
>
>
> Xiangyu Jin
> Department of Computer Science
> University of Virginia
> Charlottesville, VA 22903
>
>
>

--------------------------
Grant Ingersoll
Center for Natural Language Processing
http://www.cnlp.org

Read the Lucene Java FAQ at http://wiki.apache.org/jakarta-lucene/
LuceneFAQ