[Private mail => list, as requested]
Marvin Humphrey wrote:
>> I'd like to be able to use kinosearch to generate my tag
>> clouds, which are essentially mappings between all terms in a given
>> field and the number of postings for that term. Is there a way
>> (supported or otherwise) for me to grab this directly from the index
>> data, or do I have to grovel around doing a search for each individual
>> term and counting up the hits?
>
> If all you need is the document frequency for each term in the corpus,
> and it doesn't matter whether there some of those docs might be deleted,
> you can use this unsupported method in any version of KS:
>
> my $doc_freq = $searcher->doc_freq($term);
>
> That's nice and fast, because all it does is access the term infos file
> rather than consult the postings files as a search would.
Nice and fast indeed! It's just that...
perl -MGlob -le 'print Glob->searcher->search(query => "tag:theology")->total_hits;
print Glob->searcher->doc_freq(KinoSearch::Index::Term->new("tag", "theology"))'
130
0
What am I doing wrong?
> Another way, also unsupported, is to get a terms iterator (TermEnum in
> Lucene/Plucene), which allows you to access the term infos data
> sequentially. The exact incantation to get one in KS has changed a
> number of times, though, and is in flux again today.
This one would be useful for me eventually for building tag-clouds, but for the
time being I can get away with having a "tag" table in the database and going
through that. Ideally most of my database would move to KinoSearch though.
Simon
Marvin Humphrey wrote:
>> I'd like to be able to use kinosearch to generate my tag
>> clouds, which are essentially mappings between all terms in a given
>> field and the number of postings for that term. Is there a way
>> (supported or otherwise) for me to grab this directly from the index
>> data, or do I have to grovel around doing a search for each individual
>> term and counting up the hits?
>
> If all you need is the document frequency for each term in the corpus,
> and it doesn't matter whether there some of those docs might be deleted,
> you can use this unsupported method in any version of KS:
>
> my $doc_freq = $searcher->doc_freq($term);
>
> That's nice and fast, because all it does is access the term infos file
> rather than consult the postings files as a search would.
Nice and fast indeed! It's just that...
perl -MGlob -le 'print Glob->searcher->search(query => "tag:theology")->total_hits;
print Glob->searcher->doc_freq(KinoSearch::Index::Term->new("tag", "theology"))'
130
0
What am I doing wrong?
> Another way, also unsupported, is to get a terms iterator (TermEnum in
> Lucene/Plucene), which allows you to access the term infos data
> sequentially. The exact incantation to get one in KS has changed a
> number of times, though, and is in flux again today.
This one would be useful for me eventually for building tag-clouds, but for the
time being I can get away with having a "tag" table in the database and going
through that. Ideally most of my database would move to KinoSearch though.
Simon