Mailing List Archive

extracting information from an index
I am trying to construct a term-term correlation matrix from the data
stored in the index, for an extension to the vector model that I am
researching. In case my terminology is unfamiliar, what I need in order
to do this is, for each term t, a list of those documents which contain t
(also having a record of the number of times that t occurs in each would
be a nice bonus).

From this I can calculate the rest of what I need (number of times that
terms t1 and t2 occur in the same document, etc.).

If necessary I could squeeze by with just knowing the number of documents
in which t1, t2, and the combination (t1 AND t2) appear, but having the
above information from which to work would give me more flexibility.


Anyway, if there is a straightforward way of doing this that I have not
yet spotted, I'd like to know what it is; if not, pointing me at the
appropriate chunks of the source to start hacking on would also be
appreciated.

Thanks in advance for any help that may be offered.

Regards,

Joshua O'Madadhain (Madden)

jmadden@ics.uci.edu...Obscurium Per Obscurius...www.ics.uci.edu/~jmadden
Joshua Madden: Information Scientist, Musician, Philosopher-At-Tall
It's that moment of dawning comprehension that I live for--Bill Watterson
My opinions are too rational and insightful to be those of any organization.


--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>
RE: extracting information from an index [ In reply to ]
> From: Joshua O'Madadhain [mailto:jmadden@ics.uci.edu]
>
> I am trying to construct a term-term correlation matrix from the data
> stored in the index, for an extension to the vector model that I am
> researching. In case my terminology is unfamiliar, what I
> need in order
> to do this is, for each term t, a list of those documents
> which contain t
> (also having a record of the number of times that t occurs in
> each would
> be a nice bonus).

Sounds like IndexReader.termDocs().

Doug

--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>
RE: extracting information from an index [ In reply to ]
Please don't email me directly. Keep Lucene questions on Lucene lists.

> From: Winton Davies [mailto:wdavies@overture.com]
>
> Doug, I had the same question regarding enumerating the documents for
> accountIDs.
>
> I dont think I have any field that I can quickly guarantee that a
> term will work for -- is there anyway I can just do a for i=1 to max
> doc, get doc(i).accountID ?

If you have an accountId field, then you can enumerate all terms in that
field and, for each such term, enumerate all documents with that term. This
would look something like:

TermEnum enum = reader.terms(new Term("accountId", ""));
try {
while (enum.term().field().equals("accountId")) {
String accountId = enum.term().text();
TermDocs termDocs = reader.termDocs(enum.term());
try {
while (termDocs.next()) {
... do something with accountId and termDocs.doc() ...
}
} finally {
termDocs.close();
}
if (!enum.next()) {
break;
}
}
} finally {
enum.close();
}

Is that what you need?

Doug

--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>