Hello,
I implemented a summarizing & highlighting component that can be used to summarize longer texts to present on result page. It's not well-commented/documented but maybe it can be used by others.
algorithm:
1. extract terms of query (needs Lucene modification: http://www.mail-archive.com/lucene-user@jakarta.apache.org/msg00037.html)
2. tokenize the text and collect the set T of tokens that could be highlighted (term in the token is a query term); call this set to
3. make fragments: a fragment is a token pair of T (more formal element of TxT) ; it's a substring of the text from leftToken to rightToken (leftToken can be equal to rightToken)
4. sort the fragments based on their weight (lenght of the fragment, how much tokens are in the fragment)
5. get the first N fragments where N is less than a limit (maxFragments; default 3) and the length of the fragments is less than a limit (maxLen; default 300)
6. make the output string and highligh tokens that are in T
I tested it on relative short text. I know this is not too good algorithm, I'm planning to improve.
peter
I implemented a summarizing & highlighting component that can be used to summarize longer texts to present on result page. It's not well-commented/documented but maybe it can be used by others.
algorithm:
1. extract terms of query (needs Lucene modification: http://www.mail-archive.com/lucene-user@jakarta.apache.org/msg00037.html)
2. tokenize the text and collect the set T of tokens that could be highlighted (term in the token is a query term); call this set to
3. make fragments: a fragment is a token pair of T (more formal element of TxT) ; it's a substring of the text from leftToken to rightToken (leftToken can be equal to rightToken)
4. sort the fragments based on their weight (lenght of the fragment, how much tokens are in the fragment)
5. get the first N fragments where N is less than a limit (maxFragments; default 3) and the length of the fragments is less than a limit (maxLen; default 300)
6. make the output string and highligh tokens that are in T
I tested it on relative short text. I know this is not too good algorithm, I'm planning to improve.
peter