Mailing List Archive

context and hit positions with Lucene
Hi,

I've been lurking around the Lucene source code for about a week now...
There are a couple of things I can't work out how to do properly I'd be
grateful for any help with.

I'm having a bit of trouble using hit positions in a test application, the
results of which look like I may need to contribute some code to Lucene for
things to work as I'd like.

At the moment, I'm doing something along the lines of the following, to
retrieve hit positions:

// Open an index and retrieve the hit positions object
IndexReader reader = IndexReader.open("index_file");
TermPositions hitPoints = reader.termPositions(new Term("contents",
"metal"));
TermDocs docs = (TermDocs) hitPoints;

// While a document remains, loop
while ( docs.next())
{
out.print("Finding hit values for document <b>"+ docs.doc()+"</b>");
for (int j=0; j<docs.freq(); j++)
{
// Output the hit position
out.print(", "+hitPoints.nextPosition());
}
out.println("<br>");
}
reader.close();

I'm not able to do a great deal with that information at the moment. What
I'd really like to be able to do is get the relevant info in my actual
search results loop. So I'd call something like this:

while (search_results_remain) {
Document doc = hits.doc(i);
int[] documentHitPositions = doc.getHitPositions();
// display fragments with 3 hits in the context text
String someContextInfo = hits.getContextInfo(i, 3);
}

My main difficulties with the existing way of doing things is:
1) The call to termPositions() doesn't integrate with QueryParser.parse()
and that appears to be the only correct way to use complex queries such as
wildcards, booleans, etc.
Is there any way, given a query, to get the list of 'Term' objects that were
created for the query? This would help me to an extent as I'd be able to
generate complete hit positions, rather than just for an arbitrary term.
2) Retrieving the hit positions doesn't integrate with the 'Hits' or
Document objects, where it would be the most convenient, imho, (as in my
example, above). Is it feasible to integrate such functionality?

Showing some amount of context for each search result is something that my
company considers to be really important for adopting any search engine.
Could anyone point me in the right direction for what changes, if any need
to be made to facilitate such a thing? If so, I may well be allowed to
contribute to Lucene on company time. From browsing the source and the
documentation, it appears that various things are in place to facilitate
implementing context information, I'm just not sure where exactly to
start...

Regards,

Lee Mallabone
Granta Design Ltd.
RE: context and hit positions with Lucene [ In reply to ]
Please see also Maik Schreiber's message on this topic:

http://www.geocrawler.org/archives/3/2624/2001/9/50/6553088/

The approach is to re-tokenize hit documents, scanning for query terms. The
index does not store the byte-position of words in the original document.
Only the tokenizer has that information. The index only stores the ordinal
position, e.g., that a term was the twelfth term in a document, while the
tokenizer can tell you, e.g., that a term occurs between bytes 291 and 301
in the text, which is what you need for highlighting.

Perhaps we should add a utility method such as:

public static Set getHitTokens(Set queryTerms, Reader text, Analyzer a)
throws IOException {
TokenStream ts = a.tokenStream(text);
Set hitTokens = new HashSet();
for (Token token = ts.next(); token != null; token = ts.next()) {
if (queryTerms.contains(token.termText())) {
hitTokens.add(token);
}
}
return hitTokens;
}

(I have not tested this code.)

What class would we add this to? If we add it to Query then it could take a
Query instead of a Set. As Maik points out, there is currently no public
method that returns the set of terms in a query. That should probably be
added in any case.

Doug

> -----Original Message-----
> From: Lee Mallabone [mailto:lee@grantadesign.com]
> Sent: Thursday, October 04, 2001 9:00 AM
> To: lucene-dev@jakarta.apache.org
> Subject: context and hit positions with Lucene
>
>
> Hi,
>
> I've been lurking around the Lucene source code for about a
> week now...
> There are a couple of things I can't work out how to do
> properly I'd be
> grateful for any help with.
>
> I'm having a bit of trouble using hit positions in a test
> application, the
> results of which look like I may need to contribute some code
> to Lucene for
> things to work as I'd like.
>
> At the moment, I'm doing something along the lines of the
> following, to
> retrieve hit positions:
>
> // Open an index and retrieve the hit positions object
> IndexReader reader = IndexReader.open("index_file");
> TermPositions hitPoints = reader.termPositions(new Term("contents",
> "metal"));
> TermDocs docs = (TermDocs) hitPoints;
>
> // While a document remains, loop
> while ( docs.next())
> {
> out.print("Finding hit values for document <b>"+ docs.doc()+"</b>");
> for (int j=0; j<docs.freq(); j++)
> {
> // Output the hit position
> out.print(", "+hitPoints.nextPosition());
> }
> out.println("<br>");
> }
> reader.close();
>
> I'm not able to do a great deal with that information at the
> moment. What
> I'd really like to be able to do is get the relevant info in my actual
> search results loop. So I'd call something like this:
>
> while (search_results_remain) {
> Document doc = hits.doc(i);
> int[] documentHitPositions = doc.getHitPositions();
> // display fragments with 3 hits in the context text
> String someContextInfo = hits.getContextInfo(i, 3);
> }
>
> My main difficulties with the existing way of doing things is:
> 1) The call to termPositions() doesn't integrate with
> QueryParser.parse()
> and that appears to be the only correct way to use complex
> queries such as
> wildcards, booleans, etc.
> Is there any way, given a query, to get the list of 'Term'
> objects that were
> created for the query? This would help me to an extent as I'd
> be able to
> generate complete hit positions, rather than just for an
> arbitrary term.
> 2) Retrieving the hit positions doesn't integrate with the 'Hits' or
> Document objects, where it would be the most convenient,
> imho, (as in my
> example, above). Is it feasible to integrate such functionality?
>
> Showing some amount of context for each search result is
> something that my
> company considers to be really important for adopting any
> search engine.
> Could anyone point me in the right direction for what
> changes, if any need
> to be made to facilitate such a thing? If so, I may well be allowed to
> contribute to Lucene on company time. From browsing the source and the
> documentation, it appears that various things are in place to
> facilitate
> implementing context information, I'm just not sure where exactly to
> start...
>
> Regards,
>
> Lee Mallabone
> Granta Design Ltd.
>
>
>
Re: context and hit positions with Lucene [ In reply to ]
Doug Cutting wrote:
> Please see also Maik Schreiber's message on this topic:
>
> http://www.geocrawler.org/archives/3/2624/2001/9/50/6553088/

Great! Thanks, that's a real help.

> The
> index does not store the byte-position of words in the original document.

Does that rule out the potential to implement proximity operators? I need to
implement NEAR (and then SAME for paragraph searches), but I'm a novice in
terms of search engine implementations. Am I likely to be out of my depth
attempting that right now with Lucene?

> Perhaps we should add a utility method such as:
>
> public static Set getHitTokens(Set queryTerms, Reader text, Analyzer a)
..snip..
> What class would we add this to? If we add it to Query then it could take
a
> Query instead of a Set. As Maik points out, there is currently no public
> method that returns the set of terms in a query. That should probably be
> added in any case.

As you suggest, I think taking a Query rather than a Set would be the most
convenient.

This looks good, but what about the (future) case where you have complex
(possibly nested) proximity searches and only want to highlight the relevant
tokens when they appear near each other?

My company really likes Lucene, but we have a customer with *very* stringent
search requirements so I'm trying to determine if we can implement all of
them with or on top of Lucene.

Regards,

Lee Mallabone.
RE: context and hit positions with Lucene [ In reply to ]
> From: Lee Mallabone [mailto:lee@grantadesign.com]
>
> > The
> > index does not store the byte-position of words in the
> original document.
>
> Does that rule out the potential to implement proximity
> operators? I need to
> implement NEAR (and then SAME for paragraph searches), but
> I'm a novice in
> terms of search engine implementations. Am I likely to be out
> of my depth
> attempting that right now with Lucene?

Lucene does not directly support paragraph-based searching.

Lucene does support proximity searches, e.g., exact phrases, and within-N
words (slop). Please see the documentation for PhraseQuery, especially the
setSlop(int) method:

http://jakarta.apache.org/lucene/api/org/apache/lucene/search/PhraseQuery.ht
ml

Phrase slop is thus essentially WITHIN. The queryParser class does not yet
have a syntax to specify slop.

> > Perhaps we should add a utility method such as:
> >
> > public static Set getHitTokens(Set queryTerms, Reader
> text, Analyzer a)

> This looks good, but what about the (future) case where you
> have complex
> (possibly nested) proximity searches and only want to
> highlight the relevant
> tokens when they appear near each other?

As you point out, the method I suggest would highlight isolated occurrences
of terms from query phrases in hit documents, even when they do not occur in
phrases. (Note that for the document to be a hit, they will somewhere also
occur together in a phrase, and possibly quite frequently for a high-scoring
hit.) Google and most other search engines implement term highlighting this
way, and I think it is acceptable. One could of course write a
TokenStream-based query evaluator that correctly interpreted phrasal
restrictions when highlighting. Personally, I do not think it is worth the
effort, so I am not volunteering to do it myself.

Doug
Re: context and hit positions with Lucene [ In reply to ]
Doug Cutting wrote:

> Phrase slop is thus essentially WITHIN. The queryParser class does not
yet
> have a syntax to specify slop.

Are there any short term plans to extend the queryParser to handle this?

> One could of course write a
> TokenStream-based query evaluator that correctly interpreted phrasal
> restrictions when highlighting. Personally, I do not think it is worth
the
> effort, so I am not volunteering to do it myself.

I'd be very inclined to agree with you. :)

Cheers,

Lee Mallabone.