Mailing List Archive: index compression schemes

index compression schemes

Jun 30, 2008, 12:52 PM

Post #1 of 3 (2563 views)

I was thinking about how one makes good use of multiple cores, and
came across an interesting paper comparing inverted index compression
schemes:
http://www2008.org/papers/pdf/p387-zhangA.pdf
I'm not up on this, but hadn't heard at all about the PForDelta scheme
they extoll. You've probably seen it, but otherwise might be worth a
skim.

--nate

_______________________________________________
KinoSearch mailing list
KinoSearch@rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch

Re: index compression schemes [ In reply to ]

marvin at rectangular

Jun 30, 2008, 8:22 PM

Post #2 of 3 (2435 views)

Permalink

On Jun 30, 2008, at 12:52 PM, Nathan Kurz wrote:

> I was thinking about how one makes good use of multiple cores, and
> came across an interesting paper comparing inverted index compression
> schemes:
> http://www2008.org/papers/pdf/p387-zhangA.pdf

Good stuff! I hadn't read this particular paper, but the index
structures described at the front of the paper resemble KS closely and
it's highly relevant.

> I'm not up on this, but hadn't heard at all about the PForDelta scheme
> they extoll. You've probably seen it

I was not familiar with the specifics of the PForDelta scheme.
However, I have a good understanding of the problem that it is
designed to solve (compression of posting data).

The main thing to take away from this paper is the simple fact that
*competing formats exist*. Our goal must be to devise a robust and
efficient plugin format, one that would allow us to use PForDelta
compression, what the paper calls "VByte" (basically what KS uses now)
or any of the other coding strategies.

The KinoSearch::Posting abstract class currently encapsulates our
plugin format, but it has some limitations. My original plan was to
have a one-to-one relationship between Posting subclasses and index
formats, but that turns out to be insufficient, and the PForDelta algo
shows why.

PForDelta is only appropriate for batch processing:

Some of the methods, in particular PForDelta and our implementation
of Rice Coding, are not suitable for very short lists, as they
operate on multiples of 32 or 128 numbers. To address this issue,
we
always used variable-byte coding for any inverted list with fewer
than 100 postings, while for longer lists we padded the inverted
lists with dummy entries as needed by the various compression
methods.

However, Posting was not designed to maintain state well enough to
batch process -- the writing method just took several arguments
describing the last posting and the current posting in the loop.
We're really going to need a dedicated PostingEncoder class to handle
something like that. And then probably we will need a dedicated
PostingDecoder class as well for search-time.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/

_______________________________________________
KinoSearch mailing list
KinoSearch@rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch

Re: index compression schemes [ In reply to ]

nate at verse

Jul 6, 2008, 1:35 PM

Post #3 of 3 (2390 views)

Permalink

On Mon, Jun 30, 2008 at 9:22 PM, Marvin Humphrey <marvin@rectangular.com> wrote:
> The main thing to take away from this paper is the simple fact that
> *competing formats exist*. Our goal must be to devise a robust and
> efficient plugin format, one that would allow us to use PForDelta
> compression, what the paper calls "VByte" (basically what KS uses now) or
> any of the other coding strategies.

Yes, that was the point I took home as well. I was a little surprised
by the relatively poor performance of the variable byte encoding,
though, or conversely by the very good performance of some of the
block methods. This probably means I need to be more aware of branch
prediction when thinking about optimization.

> The KinoSearch::Posting abstract class currently encapsulates our plugin
> format, but it has some limitations. My original plan was to have a
> one-to-one relationship between Posting subclasses and index formats, but
> that turns out to be insufficient, and the PForDelta algo shows why.

Yes, although I might question the word 'insufficient', as it might be
taken to imply we need even more Posting classes to encompass multiway
relationships. But I agree that the Posting class requires special
thought as to how it will be extended to allow for smooth interaction
between custom scorers and custom index formats.

The goal here, in my mind, is to make it possible to write a custom
index format that works with all existing scorers for which the index
holds the relevant data. Vice versa, it should also be possible to
write a custom scorer that makes use of existing indexes without
having to modify or subclass these indexes.

> However, Posting was not designed to maintain state well enough to batch
> process -- the writing method just took several arguments describing the
> last posting and the current posting in the loop. We're really going to
> need a dedicated PostingEncoder class to handle something like that. And
> then probably we will need a dedicated PostingDecoder class as well for
> search-time.

This might provide good generality, but I think I prefer a more
minimalist solution, with the Posting class acting as a passive
container that is filled by the Index and scored by Scorer. The
Scorer chooses the class, since it presumably needs all the data to
score with and passes it to the Index which fills in the data fields.

The index would need some custom logic for parents of the fullest
Posting class it can handle, but I think this would be
straightforward. Block compression and the like would all happen
within the index, and the Posting classes and Scorers would remain
blissfully ignorant. I haven't thought about it, but I think the same
benefits would extend to Indexers.

Nathan Kurz
nate@verse.com

_______________________________________________
KinoSearch mailing list
KinoSearch@rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch