Mailing List Archive: summarizing & highlighting

summarizing & highlighting

Apr 15, 2002, 4:28 PM

Post #1 of 3 (408 views)

Hello,
I implemented a summarizing & highlighting component that can be used to summarize longer texts to present on result page. It's not well-commented/documented but maybe it can be used by others.

algorithm:
1. extract terms of query (needs Lucene modification: http://www.mail-archive.com/lucene-user@jakarta.apache.org/msg00037.html)
2. tokenize the text and collect the set T of tokens that could be highlighted (term in the token is a query term); call this set to
3. make fragments: a fragment is a token pair of T (more formal element of TxT) ; it's a substring of the text from leftToken to rightToken (leftToken can be equal to rightToken)
4. sort the fragments based on their weight (lenght of the fragment, how much tokens are in the fragment)
5. get the first N fragments where N is less than a limit (maxFragments; default 3) and the length of the fragments is less than a limit (maxLen; default 300)
6. make the output string and highligh tokens that are in T

I tested it on relative short text. I know this is not too good algorithm, I'm planning to improve.

peter

Re: summarizing & highlighting [ In reply to ]

carlson at bookandhammer

Apr 15, 2002, 4:55 PM

Post #2 of 3 (407 views)

Permalink

Hi Peter,

Thanks for this contribution.

Can we put your code under the Apache License (Mark already said that his
code could be used under the apache license)?

Thanks

--Peter

On 4/15/02 4:28 PM, "Halácsy Péter" <halacsy.peter@axelero.com> wrote:

> summarizing & highlighting component that can be used to summarize longer
> texts to present on result page. It's not well-commented/documented but maybe
> it can be used by others.

--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>

RE: summarizing & highlighting [ In reply to ]

halacsy.peter at axelero

Apr 16, 2002, 12:04 AM

Post #3 of 3 (402 views)

Permalink

> -----Original Message-----
> From: Peter Carlson [mailto:carlson@bookandhammer.com]
> Sent: Tuesday, April 16, 2002 1:56 AM
> To: Lucene Users List
> Subject: Re: summarizing & highlighting
>
>
> Hi Peter,
>
> Thanks for this contribution.
>
> Can we put your code under the Apache License (Mark already
> said that his
> code could be used under the apache license)?

Of course.

peter

--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>

Mailing List Archive

Mailing List Archive

Attached Files: