Mailing List Archive: Use lucene custom scorer for highlighting?

Hello Lucene Developers,
We're working on a search service which uses lucene indexes. One of the things I'm hoping to find is different places where we can plug in our custom classes during the search process.
This first use case is for highlighting. The legacy search engine we use collects all term positions for highlighting during the search process. So everything happens all at once instead of the search-first-then-highlight-model. For how we use highlighting, this is more efficient for us, instead of reprocessing the query.
One thought I had was creating a custom scorer that would be called during search, and it would gather highlights in addition to scoring. I think this would be especially useful for proximity queries, or any other scoring based on positions of words in the document. Instead of advancing the term vectors and finding phrases in a document at search time, and then doing it AGAIN at highlight time, if there was a way to access the data used by the search process.

Any suggestions, comments, or references that would enlighten me would be appreciated. I've had great difficulty finding helpful documents as I get to know Lucene.

Thanks,
Chris Hahn
This e-mail is for the sole use of the intended recipient and contains information that may be privileged and/or confidential. If you are not an intended recipient, please notify the sender by return e-mail and delete this e-mail and any attachments. Certain required legal entity disclosures can be accessed on our website: https://www.thomsonreuters.com/en/resources/disclosures.html

Hi Chris,

While this is theoretically possible, this would require rewriting all
queries that you might want to run, so this would be a huge investment.

In general doing something like that is a bad idea since it requires
computing highlights for many documents that may not make it to the top-k
hits.

On Thu, Nov 4, 2021 at 5:44 PM Hahn, Christopher (TR Technology) <
christopher.hahn@thomsonreuters.com> wrote:

> Hello Lucene Developers,
>
> We’re working on a search service which uses lucene indexes. One of the
> things I’m hoping to find is different places where we can plug in our
> custom classes during the search process.
>
> This first use case is for highlighting. The legacy search engine we use
> collects all term positions for highlighting during the search process. So
> everything happens all at once instead of the
> search-first-then-highlight-model. For how we use highlighting, this is
> more efficient for us, instead of reprocessing the query.
>
> One thought I had was creating a custom scorer that would be called during
> search, and it would gather highlights in addition to scoring. I think this
> would be especially useful for proximity queries, or any other scoring
> based on positions of words in the document. Instead of advancing the term
> vectors and finding phrases in a document at search time, and then doing it
> AGAIN at highlight time, if there was a way to access the data used by the
> search process.
>
>
>
> Any suggestions, comments, or references that would enlighten me would be
> appreciated. I’ve had great difficulty finding helpful documents as I get
> to know Lucene.
>
>
>
> Thanks,
>
> Chris Hahn
> This e-mail is for the sole use of the intended recipient and contains
> information that may be privileged and/or confidential. If you are not an
> intended recipient, please notify the sender by return e-mail and delete
> this e-mail and any attachments. Certain required legal entity disclosures
> can be accessed on our website:
> https://www.thomsonreuters.com/en/resources/disclosures.html
>

--
Adrien