Mailing List Archive

How does Lucene handle phrases containing words that are not indexed?
How does Lucene handle phrases (literals) containing words that are not
indexed? (e.g. stopwords, one-letter words, numbers)? I did some tests
(lucene demo, my own 120000 xml documents, Cocoon search) and in all cases
it looks like that when you are looking for the phrase "a specification" it
also finds documents which contain "the specification". (or: "D. Washington"
instead of "G. Washington").

Of course you can change the index behaviour and make sure there are no
stopwords, and all one-letter words and numbers are indexed. But that seems
a bad approach. A better approach: 1) find all indexed words in the phrase
and from these words find all documents containing these words. 2) check the
occurence of the phrase by opening the original document. I am wondering:
does Lucene performs step 2)? Off course this step burns some cpu cycles.

Hugo

hugob@xs4all.nl


--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>