Mailing List Archive: Using Kinosearch file format as generic inverted index

Simon,

I don't know if this is realistic right now. The problem is that
KinoSearch, like Lucene, is tightly bound to its file format. It's
possible, but it would take a fairly deep understanding of KS's data
structures and an awful lot of hacking on stuff that isn't public.

Nevertheless, this is *precisely* the direction that I want to take
KinoSearch, and Lucy.

Document and Field should both be abstract classes that specify
serialization/deserialization methods.

In the near term, that change is crucial for liberating KinoSearch
from its file format. Once the logic for reading the index lives in
a plugin, and fewer classes spec how to read the index directly,
backwards compatibility suddenly gets a hell of a lot easier... and
we can get rid of that dang "alpha" label.

In the longer term, I hope to enable innovation along the lines of
what you propose to do. Other examples include...

* "boost per-position", allowing, say, text between h1
tags to contribute more than text between p tags.
* tracking part-of-speech per-position
* Associating each term with LSA vectors
* ????? -- a generic inverted index will hopefully be put
to uses that not currently envisioned.

The plan is to battle-test the abstraction privately first using a
new file format which will fit with this scheme more comfortably than
the current one. The target release for the private API is 0.20.
Once we cross that threshold, it will be easier to do what you
propose, if you're willing to live on the bleeding edge and hack away
at the internals.

> Related to my previous post and some algorithms I've been playing
> aorund
> with I'd like to tyr and see if I can get a performance boost out of
> using the KinoSearch InvIndex to store some graph data.
>
> I need to store a node id and then a list of other node ids that it
> links to. The edge needs to have 2 other arbitary fields attached
> to it
> - a type and value (although I suppose the type could be done y having
> each different type in different indexes). Preferably each node should
> be able to be looked up as an id or as a value.

Does it need to be per-position or per-term? This can get fairly
expensive if you need it per-position. Think of whether each word in
a book's index needs the tagging, or whether each page number within
each index entry needs the tagging. If it's each page number, then
you need a lot more space than if it's per-term.

> Understandably the docs don't really go into how to do this - the
> various classes seem a bit ... sparse on POD :)

Sparse on visible POD at least for private classes -- by design.
However, have you snooped the actual module code rather than just
running it through perldoc or looking on search.cpan.org? In some
cases, there's fairly extensive documentation hidden away -- see
OutStream for a good example.

> Any idea on whether this is a sane thing to do and, if so, hwo to go
> about doing it?

The main classes you would need to be concerned with at search-time
are...

* TermEnum/SegTermEnum -- an "array" of Terms.
* TermDocs/SegTermDocs/MultiTermDocs -- for each term, an "array"
of doc numbers and other info.
* TermBuffer -- does the deserialization for SegTermEnum.

TermEnum and TermDocs aren't really arrays, they're iterators, but
it's easier if we think of them giant arrays.

At index-time, it's harder to describe what's going on, but for the
record, the classes that handle the low level writing are
PostingsWriter, TermInfosWriter, and SegWriter.

The idea is that you would stuff an extra number into somewhere in
the file format, then recover it later and probably make use of it in
a specialized scorer. I haven't thought too much about the API.
Maybe something pack-ish? We'll see.

For background, see the POD in KinoSearch::Docs::FileFormat and
<http://wiki.apache.org/jakarta-lucene/FlexibleIndexing>.

Cheers,

Marvin Humphrey

--
I'm looking for a part time job.