Mailing List Archive

highlighting phrases
I am working on a modification to Lucene's highlighter. Currently all terms of
a phrase query are highlighted, even if they appear out of phrase context:
Searching for "Foo Bar" in "Foo Bar some stuff Foo" will result in
"_Foo_ _Bar_ some stuff _Foo_". It would be nicer to have
"_Foo_ _Bar_ some stuff Foo" as the result.

I already implemented this behaviour in an older version of the highlighter,
where things were still simple. But now I see that there was a modification to
deal with overlapping tokens. These make the whole matter much more complicated.
But I guess that I will try to merge my old phrase highlighter code with the
current version of Lucene.

Is anybody working on this kind of phrase highlighting?
Would my modifications be of interest to you?

Best regards,
Guido Wegener

--
Guido Wegener
startext Unternehmensberatung GmbH
Kennedyallee 2, D-53175 Bonn
Tel: +49 (0)228 959 96-26, Fax: +49 (0)228 959 96-66
Internet: http://www.startext.de, E-Mail: gwe@startext.de



---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
Re: highlighting phrases [ In reply to ]
Guide, your modification sounds good. If you can contribute it, that
would be great.

Otis

--- Guido Wegener <gwe@startext.de> wrote:

> I am working on a modification to Lucene's highlighter. Currently all
> terms of
> a phrase query are highlighted, even if they appear out of phrase
> context:
> Searching for "Foo Bar" in "Foo Bar some stuff Foo" will result in
> "_Foo_ _Bar_ some stuff _Foo_". It would be nicer to have
> "_Foo_ _Bar_ some stuff Foo" as the result.
>
> I already implemented this behaviour in an older version of the
> highlighter,
> where things were still simple. But now I see that there was a
> modification to
> deal with overlapping tokens. These make the whole matter much more
> complicated.
> But I guess that I will try to merge my old phrase highlighter code
> with the
> current version of Lucene.
>
> Is anybody working on this kind of phrase highlighting?
> Would my modifications be of interest to you?
>
> Best regards,
> Guido Wegener
>
> --
> Guido Wegener
> startext Unternehmensberatung GmbH
> Kennedyallee 2, D-53175 Bonn
> Tel: +49 (0)228 959 96-26, Fax: +49 (0)228 959 96-66
> Internet: http://www.startext.de, E-Mail: gwe@startext.de
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
Re: highlighting phrases [ In reply to ]
Guido Wegener wrote:

> I am working on a modification to Lucene's highlighter. Currently all terms of
> a phrase query are highlighted, even if they appear out of phrase context:
> Searching for "Foo Bar" in "Foo Bar some stuff Foo" will result in
> "_Foo_ _Bar_ some stuff _Foo_". It would be nicer to have
> "_Foo_ _Bar_ some stuff Foo" as the result.
>
> I already implemented this behaviour in an older version of the highlighter,
> where things were still simple. But now I see that there was a modification to
> deal with overlapping tokens. These make the whole matter much more complicated.
Due, I believe, to a discovery I made w/ an Analyzer that takes
advantage of this advanced Lucene functionality (mult tokens at same
place). I'v since realized that this issue also affects query expansion
code...
> But I guess that I will try to merge my old phrase highlighter code with the
> current version of Lucene.
>
> Is anybody working on this kind of phrase highlighting?
> Would my modifications be of interest to you?
>
> Best regards,
> Guido Wegener
>


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
highlighting phrases [ In reply to ]
I am working on a modification to Lucene's highlighter. Currently all terms of
a phrase query are highlighted, even if they appear out of phrase context:
Searching for "Foo Bar" in "Foo Bar some stuff Foo" will result in
"_Foo_ _Bar_ some stuff _Foo_". It would be nicer to have
"_Foo_ _Bar_ some stuff Foo" as the result.

I already implemented this behaviour in an older version of the highlighter,
where things were still simple. But now I see that there was a modification to
deal with overlapping tokens. These make the whole matter much more complicated.
But I guess that I will try to merge my old phrase highlighter code with the
current version of Lucene.

Is anybody working on this kind of phrase highlighting?
Would my modifications be of interest to you?

Best regards,
Guido Wegener

--
Guido Wegener
startext Unternehmensberatung GmbH
Kennedyallee 2, D-53175 Bonn
Tel: +49 (0)228 959 96-26, Fax: +49 (0)228 959 96-66
Internet: http://www.startext.de, E-Mail: gwe@startext.de



---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
Re: highlighting phrases [ In reply to ]
> Is anybody working on this kind of phrase highlighting?
> Would my modifications be of interest to you?

I haven't started working on this yet, but my application does have this
problem and I need to come up with a fix. If you have any code to fix phrase
highlighting I'd be more than happy to help test it. Also if you need some
help merging your previous fix I could possibly help with that.

Jason

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
Re: highlighting phrases [ In reply to ]
Adding support for phrases could be tricky.
So far I have deliberately avoided reimplementing specialized highlighting logic for each of the different types of
queries eg understanding the nuances of "slop factor" in Phrase queries. I may be wrong but adding specialized
support for different query types just feels like the start of a slippery slope.

If people are keen to add such support though, here are some pointers to bear in mind...

Remember that the highlighter is also designed to summarize docs by selecting best fragments.
One decision to be made up front is to consider if a special "Fragmenter" implementation is required that uses the
query to influence the way it breaks the doc into fragments ie. it ensures that matching words in phrase queries
or span queries remain in the same fragment.

If phrases matches are allowed to span fragments thought needs to be given as to how the fragments are scored.

Do phrases/spans get marked up with one tag eg <B>My Phrase</B> or many eg <B>My</B> <B>Phrase</B> ?
I expect "many" is the answer given the possibility of other query terms appearing intermingled in a phrase with a
high slop factor or a span.

The position of terms in the phrases will need to be known by the Formatter implementation before attempting
to mark up the text. This could/should be done using position info in the Lucene index rather than requiring a separate
analyzer pass over the original text.

Most of this should be acheivable using specialized implementations of Formatter, Fragmenter and Scorer so the main
Highlighter code should be untouched.

These are just some of the "gotchas" off the top of my head. I'm sure there will be several more issues waiting to be revealed...
Hope this helps anyway.
Cheers
Mark

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org