Mailing List Archive

summarizing & highlighting
Hello,
I implemented a summarizing & highlighting component that can be used to summarize longer texts to present on result page. It's not well-commented/documented but maybe it can be used by others.

algorithm:
1. extract terms of query (needs Lucene modification: http://www.mail-archive.com/lucene-user@jakarta.apache.org/msg00037.html)
2. tokenize the text and collect the set T of tokens that could be highlighted (term in the token is a query term); call this set to
3. make fragments: a fragment is a token pair of T (more formal element of TxT) ; it's a substring of the text from leftToken to rightToken (leftToken can be equal to rightToken)
4. sort the fragments based on their weight (lenght of the fragment, how much tokens are in the fragment)
5. get the first N fragments where N is less than a limit (maxFragments; default 3) and the length of the fragments is less than a limit (maxLen; default 300)
6. make the output string and highligh tokens that are in T

I tested it on relative short text. I know this is not too good algorithm, I'm planning to improve.

peter
Re: summarizing & highlighting [ In reply to ]
Hi Peter,

Thanks for this contribution.

Can we put your code under the Apache License (Mark already said that his
code could be used under the apache license)?

Thanks

--Peter


On 4/15/02 4:28 PM, "Halácsy Péter" <halacsy.peter@axelero.com> wrote:

> summarizing & highlighting component that can be used to summarize longer
> texts to present on result page. It's not well-commented/documented but maybe
> it can be used by others.


--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>
RE: summarizing & highlighting [ In reply to ]
> -----Original Message-----
> From: Peter Carlson [mailto:carlson@bookandhammer.com]
> Sent: Tuesday, April 16, 2002 1:56 AM
> To: Lucene Users List
> Subject: Re: summarizing & highlighting
>
>
> Hi Peter,
>
> Thanks for this contribution.
>
> Can we put your code under the Apache License (Mark already
> said that his
> code could be used under the apache license)?

Of course.

peter

--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>