Mailing List Archive

Re: KinoSearch postings database
[Private mail => list, as requested]

Marvin Humphrey wrote:
>> I'd like to be able to use kinosearch to generate my tag
>> clouds, which are essentially mappings between all terms in a given
>> field and the number of postings for that term. Is there a way
>> (supported or otherwise) for me to grab this directly from the index
>> data, or do I have to grovel around doing a search for each individual
>> term and counting up the hits?
>
> If all you need is the document frequency for each term in the corpus,
> and it doesn't matter whether there some of those docs might be deleted,
> you can use this unsupported method in any version of KS:
>
> my $doc_freq = $searcher->doc_freq($term);
>
> That's nice and fast, because all it does is access the term infos file
> rather than consult the postings files as a search would.

Nice and fast indeed! It's just that...

perl -MGlob -le 'print Glob->searcher->search(query => "tag:theology")->total_hits;
print Glob->searcher->doc_freq(KinoSearch::Index::Term->new("tag", "theology"))'
130
0

What am I doing wrong?

> Another way, also unsupported, is to get a terms iterator (TermEnum in
> Lucene/Plucene), which allows you to access the term infos data
> sequentially. The exact incantation to get one in KS has changed a
> number of times, though, and is in flux again today.

This one would be useful for me eventually for building tag-clouds, but for the
time being I can get away with having a "tag" table in the database and going
through that. Ideally most of my database would move to KinoSearch though.

Simon
Re: KinoSearch postings database [ In reply to ]
On Apr 14, 2007, at 7:08 AM, Simon Cozens wrote:

> Nice and fast indeed! It's just that...
>
> perl -MGlob -le 'print Glob->searcher->search(query =>
> "tag:theology")->total_hits;
> print Glob->searcher->doc_freq(KinoSearch::Index::Term->new("tag",
> "theology"))'
> 130
> 0
>
> What am I doing wrong?

Looks like an Analyzer mismatch issue. I'll bet that the Analyzer
for the 'tag' field is an English PolyAnalyzer, so the text is being
stemmed. What do you get when you try 'theolog'?

> This one would be useful for me eventually for building tag-clouds,
> but for the
> time being I can get away with having a "tag" table in the database
> and going
> through that. Ideally most of my database would move to KinoSearch
> though.

More on this in a second reply...

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/