Hello,
Seems I and the program are named after the same character from the same
book!
Well when something is named after you it just begs to be used... especially
when you have a use for it.
My problem is I want an inverted index of the Google N-gram data
http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006T13
The following is an example of the 4-gram data in this corpus:
serve as the incoming 92
serve as the incubator 99
serve as the independent 794
serve as the index 223
File sizes: approx. 24 GB compressed (gzip'ed) text files
Number of tokens: 1,024,908,267,229
Number of sentences: 95,119,665,584
Number of unigrams: 13,588,391
Number of bigrams: 314,843,401
Number of trigrams: 977,069,902
Number of fourgrams: 1,313,818,354
Number of fivegrams: 1,176,470,663
It's the +3,793 Million documents the longest of which are 5 words and a
frequency count. Not your normal indexing problem!
I've worked on the unigrams and bigrams which are inserted just fine, and
finish() .
But when I run a query, perl get's very upset and seg faults. Not good.
I can run the query when the index is in the "S"'s just fine, but when I add
the rest and finish : seg fault.
I thought that some of then non-keyboard entry characters were affecting it,
but again after filtering them out: seg fault.
One unusual thing that did happen during the indexing was a power failure.
However the insert and search process seem to work and only when the bigram
indexing process finishes does it cause problems.
And of course there are still the trigrams, fourgrams and fivegrams.
Do you have any ideas or know that this is impossible to do with a single
index?
Maybe a job for 0.20 ?
It seems to be doing a good job until it finishes.
Also one other option I am looking at is building the indexes in parallel
and merging them into a unified index. I know it's possible but will it be
happy with the sizes I have to deal with?
And in general are there recommended and absolute limits on the index size?
If it can handle this then the few extra million semantic records I want to
mix in should be easy.
Bests,
Kino Coursey (the other Kino)
Daxtron Labs
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.rectangular.com/pipermail/kinosearch/attachments/20070308/ba2ac360/attachment.html
Seems I and the program are named after the same character from the same
book!
Well when something is named after you it just begs to be used... especially
when you have a use for it.
My problem is I want an inverted index of the Google N-gram data
http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006T13
The following is an example of the 4-gram data in this corpus:
serve as the incoming 92
serve as the incubator 99
serve as the independent 794
serve as the index 223
File sizes: approx. 24 GB compressed (gzip'ed) text files
Number of tokens: 1,024,908,267,229
Number of sentences: 95,119,665,584
Number of unigrams: 13,588,391
Number of bigrams: 314,843,401
Number of trigrams: 977,069,902
Number of fourgrams: 1,313,818,354
Number of fivegrams: 1,176,470,663
It's the +3,793 Million documents the longest of which are 5 words and a
frequency count. Not your normal indexing problem!
I've worked on the unigrams and bigrams which are inserted just fine, and
finish() .
But when I run a query, perl get's very upset and seg faults. Not good.
I can run the query when the index is in the "S"'s just fine, but when I add
the rest and finish : seg fault.
I thought that some of then non-keyboard entry characters were affecting it,
but again after filtering them out: seg fault.
One unusual thing that did happen during the indexing was a power failure.
However the insert and search process seem to work and only when the bigram
indexing process finishes does it cause problems.
And of course there are still the trigrams, fourgrams and fivegrams.
Do you have any ideas or know that this is impossible to do with a single
index?
Maybe a job for 0.20 ?
It seems to be doing a good job until it finishes.
Also one other option I am looking at is building the indexes in parallel
and merging them into a unified index. I know it's possible but will it be
happy with the sizes I have to deal with?
And in general are there recommended and absolute limits on the index size?
If it can handle this then the few extra million semantic records I want to
mix in should be easy.
Bests,
Kino Coursey (the other Kino)
Daxtron Labs
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.rectangular.com/pipermail/kinosearch/attachments/20070308/ba2ac360/attachment.html