Mailing List Archive

Kino uses KinoSearch with Google n-grams...
Hello,



Seems I and the program are named after the same character from the same
book!

Well when something is named after you it just begs to be used... especially
when you have a use for it.



My problem is I want an inverted index of the Google N-gram data
http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006T13

The following is an example of the 4-gram data in this corpus:

serve as the incoming 92
serve as the incubator 99
serve as the independent 794
serve as the index 223

File sizes: approx. 24 GB compressed (gzip'ed) text files
Number of tokens: 1,024,908,267,229
Number of sentences: 95,119,665,584

Number of unigrams: 13,588,391
Number of bigrams: 314,843,401
Number of trigrams: 977,069,902
Number of fourgrams: 1,313,818,354
Number of fivegrams: 1,176,470,663



It's the +3,793 Million documents the longest of which are 5 words and a
frequency count. Not your normal indexing problem!

I've worked on the unigrams and bigrams which are inserted just fine, and
finish() .

But when I run a query, perl get's very upset and seg faults. Not good.

I can run the query when the index is in the "S"'s just fine, but when I add
the rest and finish : seg fault.

I thought that some of then non-keyboard entry characters were affecting it,
but again after filtering them out: seg fault.

One unusual thing that did happen during the indexing was a power failure.
However the insert and search process seem to work and only when the bigram
indexing process finishes does it cause problems.



And of course there are still the trigrams, fourgrams and fivegrams.



Do you have any ideas or know that this is impossible to do with a single
index?

Maybe a job for 0.20 ?

It seems to be doing a good job until it finishes.

Also one other option I am looking at is building the indexes in parallel
and merging them into a unified index. I know it's possible but will it be
happy with the sizes I have to deal with?

And in general are there recommended and absolute limits on the index size?



If it can handle this then the few extra million semantic records I want to
mix in should be easy.



Bests,

Kino Coursey (the other Kino)

Daxtron Labs





-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.rectangular.com/pipermail/kinosearch/attachments/20070308/ba2ac360/attachment.html
Kino uses KinoSearch with Google n-grams... [ In reply to ]
On Mar 8, 2007, at 7:47 PM, Kino Coursey wrote:
> Seems I and the program are named after the same character from the
> same book!
Guess I'm not the only one it made a deep impression upon. :)
> It?s the +3,793 Million documents
That number of small documents shouldn't pose a problem. The number
of terms may be where we're running up against something. In theory,
the maximum number of terms per-index should be somewhere just shy of
2**31, as they are tracked using a signed 32-bit integer. However,
there may be bottlenecks somewhere else I didn't think about. Maybe
there's some term-number arithmetic that wraps somewhere.

The way to hunt this down is to design an algorithm specifically for
maxing out unique terms and see where it chokes.

[ ... investigates ... ]

Found one bottleneck. The loop iteration variable in
PostingsWriter's big finishing loop is a 32-bit integer. It
definitely ought to be a 64-bit integer, because it increments once
for each posting list (one term, one doc, multiple positions). That
really needs to get fixed; however, it ought to result in the
exclusion of high sorting terms (higher field number and term text
closer to 'z'), rather than cause a segfault.
> I can run the query when the index is in the ?S??s just fine, but
> when I add the rest and finish : seg fault.
If you are on Linux and can spare the cycles, it would be interesting
to see what Valgrind has to say about this seg fault.
> One unusual thing that did happen during the indexing was a power
> failure.
I doubt that affected things. KinoSearch's indexing is robust in the
face of crashes. There's a moment when new data is committed via the
renaming of a file; if the indexing process stops before that,
there's no change.
> Maybe a job for 0.20 ?
I would like to get this sorted before the official release of 0.20.
If, for some reason, accommodating a large number of terms (assuming
that is the issue) requires a backwards-incompatible change, I'd like
to bundle that change with all the others. I doubt that will be the
case, though. The architecture is derived from Lucene's, which has
been used to handle indexes in excess of 100 million documents (190
million is the largest I recall having heard about).

> Also one other option I am looking at is building the indexes in
> parallel and merging them into a unified index. I know it?s
> possible but will it be happy with the sizes I have to deal with?
>
> And in general are there recommended and absolute limits on the
> index size?
The architecture ought to withstand several million or possibly tens
of millions of docs on a single machine, depending on document size
and required response time. After that, it will be necessary to
spread out the index over multiple machines and combine search
results using MultiSearcher and SearchServer/SearchClient.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/