Mailing List Archive: Excerpt pondering

Hi all

I posted a message asking about some of the API hooks which appear to be for
excerpt generation to the user list a couple of days ago, and haven't heard
anything back yet.

I'd like some feedback on an idea that I have to extend lucene to hold the
extra information that it needs to stop me having to reparse the entire body
text again to generate excerpts.

Basically, to work out which sections of the text have the terms that
generate the hit most frequently, I need the position of the terms in the
document. This info, AFAICS, is already stored, but isn't accessible to
someone from a Hits object. It would be nice to make it available somehow.

Also, to be able to work out where those terms were in the original
document, I'd like to store, and be able to retrieve, the start and end
offset in the original field, for each term. This info is currently attached
to the Term object, but AFAICS is not stored. Whether the best place to do
that would be an extension to the existing segments, or in a separate
segment file, I'm not sure. I haven't really spent enough time looking at
the mechanics of the files yet.

I'd really appreciate it if someone who understands how things work
underneath could say "That sounds great, but try it like this" or "Don't do
anything, we're currently implementing something similar" or even "You
idiot, look at http://xyz/ to do that".

Thanks

Tom

--

Tom Dunstan

Mobile 0417 895 244
_______

Intec Consulting Group
* PO Box 7012 Hutt Street * Level 1, 1 Hutt Street * Adelaide 5000
* Tel +61 8 8359 2332 * Fax +61 8 8359 2264
Email: tom.dunstan@intecgroup.com.au
Website: www.intecgroup.com.au

--
To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>

Tom Dunstan wrote:
> I'd like some feedback on an idea that I have to extend lucene to hold the
> extra information that it needs to stop me having to reparse the entire body
> text again to generate excerpts.
>
> Basically, to work out which sections of the text have the terms that
> generate the hit most frequently, I need the position of the terms in the
> document. This info, AFAICS, is already stored, but isn't accessible to
> someone from a Hits object. It would be nice to make it available somehow.

That's not impossible, but would require a substantial re-working of the
search code, and would probably make search slower. Also, I'm not sure
how useful it would really be.

> Also, to be able to work out where those terms were in the original
> document, I'd like to store, and be able to retrieve, the start and end
> offset in the original field, for each term. This info is currently attached
> to the Term object, but AFAICS is not stored. Whether the best place to do
> that would be an extension to the existing segments, or in a separate
> segment file, I'm not sure. I haven't really spent enough time looking at
> the mechanics of the files yet.

This would greatly increase the size of the index, and would be
difficult to make efficiently randomly accessible.

However the primary rationale for not including this in the index is
that typically you only display ten or so documents. Re-tokenizing ten
documents should only take a fraction of a second, and thus can be
efficiently done at search time: there's no need to store the exact
positions in the text.

Doug

--
To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>