Mailing List Archive

Context specific summary with the search term
Hi

We are trying to implement Lucene, and one of the requirements for the
search is to provide a context within which the search term appears in a
document. So we need something like the summary in the demo that comes with
Lucene, except it needs to contain the context within which the search term
is found (e.g. 50 words around the search term). The summary in the demo is
the first X-number of characters in the HTML file.

I have a feeling we should use Filed object for that, but I am not very
clear on how to bind a particular search term to this Field object. This
"context" would be the context of the first occurrence of the search term
in the document.

TermDocs interface would be an ideal candidate if it would have something
like <document, frequency, sample_context>.

Has anyone done something similar? Any help would be appreciated.

Best regards

Benjamin Kopic
System Architect

Interactive1
132-140 Goswell Road
London EC1V 7DY
UK
Tel: +44 (0) 207 490 5773
Fax: +44 (0) 207 251 0817
www.interactive1.com
Re: Context specific summary with the search term [ In reply to ]
On Thu, 2001-10-18 at 17:29, Benjamin Kopic wrote:
> We are trying to implement Lucene, and one of the requirements for the
> search is to provide a context within which the search term appears in a
> document.
> Has anyone done something similar? Any help would be appreciated.

Hi,

This is something I also need to implement in the very near future. My
current thoughts are to use a variant of Maik Schreiber's way of doing
term highlighting in documents. See:
http://www.iq-computing.de/lucene/highlight.htm

Rather than highlight terms, I would just extract the first hit token,
and a certain number of characters either side of it.

This may not be the best approach, but it looks like the easiest method
to get working. I'm also not sure how realistic it will be from a
performance perspective, so if people have any alternative ideas, I'd be
happy to collaborate on an implementation...

Regards,

--
Lee Mallabone
RE: Context specific summary with the search term [ In reply to ]
> From: Lee Mallabone [mailto:lee@grantadesign.com]
>
> This is something I also need to implement in the very near future. My
> current thoughts are to use a variant of Maik Schreiber's way of doing
> term highlighting in documents. See:
> http://www.iq-computing.de/lucene/highlight.htm
>
> Rather than highlight terms, I would just extract the first hit token,
> and a certain number of characters either side of it.
>
> This may not be the best approach, but it looks like the
> easiest method to get working. I'm also not sure how realistic it will be
from a
> performance perspective, so if people have any alternative
> ideas, I'd be happy to collaborate on an implementation...

I think this is the best approach. Since you'll probably only be displaying
around ten hits at a time, the cost of re-tokenizing is fairly small.
Please consider contributing your code when it is complete.

Doug
RE: Context specific summary with the search term [ In reply to ]
> From: Lee Mallabone [mailto:lee@grantadesign.com]
>
> I'm trying to implement this and should be able to contribute any
> succesful results, but I need to produce context on a per-field basis.
> Eg. if I got a token hit in the text body of a document, but the first
> hit token was a word in the section title, I'd want to
> generate context
> around the token in the text body.

How did the title ever get indexed as the title? Presumably you split the
document into fields when it was indexed. Similarly, if you re-tokenize
things a field at a time then you should always know which field you are in,
no?

> I had been using a TokenStream to try this. However, lucene's Token
> class doesn't seem to have any concept of fields, (even when I
> tokenStream() a document that is in the index with a whole bunch of
> fields). Is there any reason for this? Moreover, any
> suggestions of how
> to find the information I need?
>
> The natural thing seems to be to have a field-aware token stream, but
> I'm not sure how I'd go about implementing that...
>
> Regards,
>
> --
> Lee Mallabone
>
RE: Context specific summary with the search term [ In reply to ]
On Fri, 2001-10-19 at 17:01, Doug Cutting wrote:
> > Rather than highlight terms, I would just extract the first hit token,
> > and a certain number of characters either side of it.
>
> I think this is the best approach. Since you'll probably only be displaying
> around ten hits at a time, the cost of re-tokenizing is fairly small.
> Please consider contributing your code when it is complete.

I'm trying to implement this and should be able to contribute any
succesful results, but I need to produce context on a per-field basis.
Eg. if I got a token hit in the text body of a document, but the first
hit token was a word in the section title, I'd want to generate context
around the token in the text body.

I had been using a TokenStream to try this. However, lucene's Token
class doesn't seem to have any concept of fields, (even when I
tokenStream() a document that is in the index with a whole bunch of
fields). Is there any reason for this? Moreover, any suggestions of how
to find the information I need?

The natural thing seems to be to have a field-aware token stream, but
I'm not sure how I'd go about implementing that...

Regards,

--
Lee Mallabone
RE: Context specific summary with the search term [ In reply to ]
On Mon, 2001-10-22 at 17:43, Doug Cutting wrote:
> > I'm trying to implement this and should be able to contribute any
> > succesful results, but I need to produce context on a per-field basis.
>
> How did the title ever get indexed as the title? Presumably you split the
> document into fields when it was indexed. Similarly, if you re-tokenize
> things a field at a time then you should always know which field you are in,
> no?

I'm indexing HTML documents marked up with comments to indicate field
boundaries. So I'd typically have:

<!--field:section_title-->
blurb
<!--field:text-->
more blurb

and so on. The documents were indexed by looking for each field marker
and then adding the subsequent lines to the relevant field.

In order to obtain a generic solution for context generation are you
suggesting I write a method that takes plain text, (eg, text form of
document) and a query, and assumes the plain text is in the query's
default field?

This doesn't seem quite as useful as getContext(Hashset queryTerms,
Reader originalDocument); which is what I was originally aiming towards.

Regards,

--
Lee Mallabone
RE: Context specific summary with the search term [ In reply to ]
> From: Lee Mallabone [mailto:lee@grantadesign.com]
> >
> > How did the title ever get indexed as the title?
>
> I'm indexing HTML documents marked up with comments to indicate field
> boundaries. So I'd typically have:
>
> <!--field:section_title-->
> blurb
> <!--field:text-->
> more blurb
>
> and so on. The documents were indexed by looking for each field marker
> and then adding the subsequent lines to the relevant field.
>
> In order to obtain a generic solution for context generation

If you're doing application-specific processing to extract fields from
documents, then a completely generic solution for extracting hit context
from documents is, by definition, impossible, since context extraction
requires field extraction.

> are you
> suggesting I write a method that takes plain text, (eg, text form of
> document) and a query, and assumes the plain text is in the query's
> default field?

I'm not exactly sure what you're proposing here, but, no, it doesn't sound
like something that I have suggested.

> This doesn't seem quite as useful as getContext(Hashset queryTerms,
> Reader originalDocument); which is what I was originally
> aiming towards.

Such a method is easy to define if the Reader contains text from a single
field. (Although you should probably pass in an Analyzer too.) However if
you're expecting such a method to automatically divide the text into fields,
then things will be harder, since Lucene's model is that applications divide
documents into fields. So you could write an application-specific version
that divides fields automatically, or, to use more generic code, you could
call such a generic method once for each field of your document, leaving
field extraction in application-specific code. Does that make sense?

Doug
RE: Context specific summary with the search term [ In reply to ]
On Tue, 2001-10-23 at 17:48, Doug Cutting wrote:
> > This doesn't seem quite as useful as getContext(Hashset queryTerms,
> > Reader originalDocument); which is what I was originally
> > aiming towards.
> to use more generic code, you could
> call such a generic method once for each field of your document, leaving
> field extraction in application-specific code. Does that make sense?

Okay, I'm now not entirely certain how useful a generic solution will be
to me, given the non-generic nature of the content I'm indexing. I think
there a lot of optomizations I can make that wouldn't be generic.

I'll play around and see what I come up with - if anything turns out to
look sufficiently generic that it might be useful, I'll post it to the
list.

Thanks for your thoughts,

--
Lee Mallabone
Re: Context specific summary with the search term [ In reply to ]
Lee Mallabone wrote:
> Okay, I'm now not entirely certain how useful a generic solution will be
> to me, given the non-generic nature of the content I'm indexing. I think
> there a lot of optomizations I can make that wouldn't be generic.

"Early optimization is the root of all evil."

Seriously, though, one thing I see Doug say often is that
lucene's indexing and searching are designed to be extremely fast. He
often responds to questions about odd details - for example, the
classic "do a search and cache the search results for paging across
multiple web pages" - by saying to just use the brute force approach
and rely on the speed of the lucene index.

I like to say, I assume that there are people out there with a
lot more on the ball than me about things like optimization. I try to
use their brains as much as possible :-). For example, with
compilers, I assume the compiler writer knew a lot more about
optimization than I do. People talk about the compiler not having the
human judgement to know what's best. That's true, but the way to deal
with that is not to try to hand-optimize my code and outguess the
compiler (which will only will only confuse the compiler and prevent
it from doing what it was designed to do). The compiler can best
optimize the program if I focus on making it clear what my intent is,
what the program is meant to do, in the structure of the code first.

This leads to another optimization slogan that I remember reading
- algorithmic optimization is much better than spot optimization. In
other words, before you try to figure out a faster way to do
something, figure out if you're doing the thing that accomplishes your
true goal in the fastest way. And figure out how important that thing
is in the grand scheme of things.

Steven J. Owens
puff@darksleep.com