Mailing List Archive

Filtering TermDocs and TermEnum
Hi All,

I have been trying for quite some time to filter the results of a termDocs
or termEnum. For example to create a termDocs that only returns documents
which match some query, or to create a termEnum that only lists terms which
come from documents which match some query. As you may suspect I tried to
use the FilterTermDocs and FilterTermEnum classes to accomplish this but
they don't appear to provide this specific functionality (honestly I can't
seem to figure out what functionality these classes are supposed to
provide).

If anyone has any information on how I may accomplish my goal of applying
TermDocs and TermEnums to subsets of the index (i.e. only documents which
match some query) please let me know.

I apologize if the answer is trivial, but this has had me stumped for a
while.

Thanks,
Eric
Re: Filtering TermDocs and TermEnum [ In reply to ]
why ?
Re: Filtering TermDocs and TermEnum [ In reply to ]
To apply statistical tools to the words

For example, say you have a large collection of news articles and you want
to know what words is appearing more often than usual today...

Then you could do a TermEnum limited to documents that were indexed today,
then you can do term enums for the previous 10 days, to find a mean and a
standard deviation for each of the words. Using this information you could
find which word is the most standard deviations over it's mean appearance
number for today, and get an idea of what words are relevant to active
stories today.

Or you wanted to see what words in your corpora of news articles were
related to the word 'foo'...

you could find the frequency for every word in the index only in documents
which match some TermQuery (like "contents:foo") then compare these
frequencies to the gross frequencies of every term in the index to find out
how relevant every term in the index is compared to foo.



On 12/27/05, Phoenix <biansutao@gmail.com> wrote:
>
> why ?
>
Re: Filtering TermDocs and TermEnum [ In reply to ]
: suspect I tried to use the FilterTermDocs and FilterTermEnum classes to
: accomplish this but they don't appear to provide this specific
: functionality (honestly I can't seem to figure out what functionality
: these classes are supposed to provide).

If i remember correctly those classes are provided as thin wrapper base
classes you may use to impliment whatever Filtering you want arround an
existing IndexReader. I'm not sure why they aren't abstract.

: If anyone has any information on how I may accomplish my goal of
: applying TermDocs and TermEnums to subsets of the index (i.e. only
: documents which match some query) please let me know.

TermDocs is easy, use a HitCollector (or wrap your query in a QueryFilter)
so you an get a BitSet representing each doc that matches you query. Then
lookup each doc returned by your underlying TermDoc to decide if you want
to expose it or not.

As for a TermEnum ... I can't think of a straight forward way beyond
getting a TermDoc (that you've alread filtered) for each Term and suming
up your own docFreq for each term.

I suspect there must be a better/easier way.


-Hoss