Mailing List Archive: Pruning (was PolyFilter and Plans)

On Apr 2, 2007, at 10:00 AM, Mike Wexler wrote:

> What is this pruning.

It's a heuristic optimization that can be applied when...

1) You have a way of establishing an absolute rank for all
documents -- e.g. page score, date of publication, price.
2) That primary ranking heavily influences which documents you
want returned.

If your data doesn't lend itself to that kind of primary ranking
technique, pruning is of no use to you.

The main application is for large search engines that use something
akin to Google's (patented) pagerank algorithm.

At index time, you sort documents so that the lowest document numbers
have the highest rank.

At search-time, you "prune" the result set by stopping the hit-
collection process early. The idea is that if you are looking for
100 hits, after you've seen 1000 or so you've probably seen most or
all of the good ones, so you don't need to process e.g. the other
5,000,000.

With a segment-based index like that used in Lucene and KS, you have
to process each segment, but even if you have 25 segments, that's
only 25,000 docs which must be processed completely. The rest must
be "skipped", which isn't free, but is still cheap compared to full
scoring.

The idea is to pay out a small price in relevance in return for a
large performance benefit.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/