On Mon, Jun 30, 2008 at 9:22 PM, Marvin Humphrey <marvin@rectangular.com> wrote:
> The main thing to take away from this paper is the simple fact that
> *competing formats exist*. Our goal must be to devise a robust and
> efficient plugin format, one that would allow us to use PForDelta
> compression, what the paper calls "VByte" (basically what KS uses now) or
> any of the other coding strategies.
Yes, that was the point I took home as well. I was a little surprised
by the relatively poor performance of the variable byte encoding,
though, or conversely by the very good performance of some of the
block methods. This probably means I need to be more aware of branch
prediction when thinking about optimization.
> The KinoSearch::Posting abstract class currently encapsulates our plugin
> format, but it has some limitations. My original plan was to have a
> one-to-one relationship between Posting subclasses and index formats, but
> that turns out to be insufficient, and the PForDelta algo shows why.
Yes, although I might question the word 'insufficient', as it might be
taken to imply we need even more Posting classes to encompass multiway
relationships. But I agree that the Posting class requires special
thought as to how it will be extended to allow for smooth interaction
between custom scorers and custom index formats.
The goal here, in my mind, is to make it possible to write a custom
index format that works with all existing scorers for which the index
holds the relevant data. Vice versa, it should also be possible to
write a custom scorer that makes use of existing indexes without
having to modify or subclass these indexes.
> However, Posting was not designed to maintain state well enough to batch
> process -- the writing method just took several arguments describing the
> last posting and the current posting in the loop. We're really going to
> need a dedicated PostingEncoder class to handle something like that. And
> then probably we will need a dedicated PostingDecoder class as well for
> search-time.
This might provide good generality, but I think I prefer a more
minimalist solution, with the Posting class acting as a passive
container that is filled by the Index and scored by Scorer. The
Scorer chooses the class, since it presumably needs all the data to
score with and passes it to the Index which fills in the data fields.
The index would need some custom logic for parents of the fullest
Posting class it can handle, but I think this would be
straightforward. Block compression and the like would all happen
within the index, and the Posting classes and Scorers would remain
blissfully ignorant. I haven't thought about it, but I think the same
benefits would extend to Indexers.
Nathan Kurz
nate@verse.com
_______________________________________________
KinoSearch mailing list
KinoSearch@rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch