Mailing List Archive: Offset-Based Analysis

Offset-Based Analysis

Feb 21, 2023, 7:43 PM

Post #1 of 4 (218 views)

Hi All,

I am trying to enrich a lucene-powered search index with data from various different NLP systems that are distributed throughout my company. Ideally this internally-derived data could be tied back to specific positions of the original text. I’ve searched around and this is the closest thing I’ve found to what I am trying to do https://jorgelbg.me/2018/03/solr-contextual-synonyms-with-payloads/ . This approach is neat but it has a few drawbacks because of its reliance on injected delimiters and somewhat inflexible passing of data from PayloadAttribute to CharTermAttribute.

One thing that occurred to me is to use an offset-based approach, of course assuming input text is already properly encoded and sanitized. I’m thinking about implementing a CharFilter that decodes some special header, which itself passes along an offset-sorted list of data for enrichment. This metadata could be referenced during analysis via custom attributes and ideally could handle a variety of use cases with the same offset-accounting logic. Some uses that come to mind are stashing values in term/payload attributes or even offset based tokenization for those wishing to tokenize outside of their search engine.

Does this approach even make any sense or have any pitfalls I am failing to see? Assuming it makes sense does a similar solution already exist? If it doesn’t exist yet would it be something that would be of interest to the community?
Any thoughts on this would be much appreciated.

Thanks,
Luke

Re: Offset-Based Analysis [ In reply to ]

mkhl at apache

Feb 21, 2023, 11:37 PM

Post #2 of 4 (218 views)

Permalink

Hello Luke.

Using offsets seems really doubtful to me. What comes to my mind is
pre-analyzed field
https://solr.apache.org/guide/solr/latest/indexing-guide/external-files-processes.html#the-preanalyzedfield-type.
Thus, external NLP service can provide ready-made tokens for
straightforward indexing by Solr. That external NLP will have all power to
inject or suspend synonyms depending on the context, and supply additional
attributes in payload (whenever it's boldness, negative/positive stress,
etc) for retrieval of these payloads later.

On Wed, Feb 22, 2023 at 6:43 AM Luke Kot-Zaniewski (BLOOMBERG/ 919 3RD A) <
lkotzaniewsk@bloomberg.net> wrote:

> Hi All,
>
> I am trying to enrich a lucene-powered search index with data from various
> different NLP systems that are distributed throughout my company. Ideally
> this internally-derived data could be tied back to specific positions of
> the original text. I’ve searched around and this is the closest thing I’ve
> found to what I am trying to do
> https://jorgelbg.me/2018/03/solr-contextual-synonyms-with-payloads/ .
> This approach is neat but it has a few drawbacks because of its reliance on
> injected delimiters and somewhat inflexible passing of data from
> PayloadAttribute to CharTermAttribute.
>
> One thing that occurred to me is to use an offset-based approach, of
> course assuming input text is already properly encoded and sanitized. I’m
> thinking about implementing a CharFilter that decodes some special header,
> which itself passes along an offset-sorted list of data for enrichment.
> This metadata could be referenced during analysis via custom attributes and
> ideally could handle a variety of use cases with the same offset-accounting
> logic. Some uses that come to mind are stashing values in term/payload
> attributes or even offset based tokenization for those wishing to tokenize
> outside of their search engine.
>
> Does this approach even make any sense or have any pitfalls I am failing
> to see? Assuming it makes sense does a similar solution already exist? If
> it doesn’t exist yet would it be something that would be of interest to the
> community?
> Any thoughts on this would be much appreciated.
>
> Thanks,
> Luke

--
Sincerely yours
Mikhail Khludnev
https://t.me/MUST_SEARCH
A caveat: Cyrillic!

Re: Offset-Based Analysis [ In reply to ]

lkotzaniewsk at bloomberg

Feb 22, 2023, 6:26 AM

Post #3 of 4 (218 views)

Permalink

Hi Mikhail,

Thanks for the quick reply and the suggestion. This is definitely good to know about. In my case however, there are several such NLP/data extraction systems and I am not sure if they all use the same tokenization but I will give this another look. I can see how this is a more well-defined solution to the problem I presented. I realize with offsets you would have to make assumptions when offset-boundaries fall in the middle of a token and other such odd cases.

Thanks again,
Luke

From: java-user@lucene.apache.org At: 02/22/23 02:38:30 UTC-5:00To: java-user@lucene.apache.org
Subject: Re: Offset-Based Analysis

Hello Luke.

Using offsets seems really doubtful to me. What comes to my mind is
pre-analyzed field
https://solr.apache.org/guide/solr/latest/indexing-guide/external-files-processe
s.html#the-preanalyzedfield-type.
Thus, external NLP service can provide ready-made tokens for
straightforward indexing by Solr. That external NLP will have all power to
inject or suspend synonyms depending on the context, and supply additional
attributes in payload (whenever it's boldness, negative/positive stress,
etc) for retrieval of these payloads later.

On Wed, Feb 22, 2023 at 6:43 AM Luke Kot-Zaniewski (BLOOMBERG/ 919 3RD A) <
lkotzaniewsk@bloomberg.net> wrote:

> Hi All,
>
> I am trying to enrich a lucene-powered search index with data from various
> different NLP systems that are distributed throughout my company. Ideally
> this internally-derived data could be tied back to specific positions of
> the original text. I’ve searched around and this is the closest thing I’ve
> found to what I am trying to do
> https://jorgelbg.me/2018/03/solr-contextual-synonyms-with-payloads/ .
> This approach is neat but it has a few drawbacks because of its reliance on
> injected delimiters and somewhat inflexible passing of data from
> PayloadAttribute to CharTermAttribute.
>
> One thing that occurred to me is to use an offset-based approach, of
> course assuming input text is already properly encoded and sanitized. I’m
> thinking about implementing a CharFilter that decodes some special header,
> which itself passes along an offset-sorted list of data for enrichment.
> This metadata could be referenced during analysis via custom attributes and
> ideally could handle a variety of use cases with the same offset-accounting
> logic. Some uses that come to mind are stashing values in term/payload
> attributes or even offset based tokenization for those wishing to tokenize
> outside of their search engine.
>
> Does this approach even make any sense or have any pitfalls I am failing
> to see? Assuming it makes sense does a similar solution already exist? If
> it doesn’t exist yet would it be something that would be of interest to the
> community?
> Any thoughts on this would be much appreciated.
>
> Thanks,
> Luke

--
Sincerely yours
Mikhail Khludnev
https://t.me/MUST_SEARCH
A caveat: Cyrillic!

Re: Offset-Based Analysis [ In reply to ]

mkhl at apache

Feb 22, 2023, 7:02 AM

Post #4 of 4 (218 views)

Permalink

One more idea. It's possible to ask Solr for essential tokenization via
/analysis/field API (here's a clue https://stackoverflow.com/a/37785401),
get token stream in structured response, and pass it into NPL pipeline for
enrichment.

On Wed, Feb 22, 2023 at 5:26 PM Luke Kot-Zaniewski (BLOOMBERG/ 919 3RD A) <
lkotzaniewsk@bloomberg.net> wrote:

> Hi Mikhail,
>
> Thanks for the quick reply and the suggestion. This is definitely good to
> know about. In my case however, there are several such NLP/data extraction
> systems and I am not sure if they all use the same tokenization but I will
> give this another look. I can see how this is a more well-defined solution
> to the problem I presented. I realize with offsets you would have to make
> assumptions when offset-boundaries fall in the middle of a token and other
> such odd cases.
>
> Thanks again,
> Luke
>
> From: java-user@lucene.apache.org At: 02/22/23 02:38:30 UTC-5:00To:
> java-user@lucene.apache.org
> Subject: Re: Offset-Based Analysis
>
> Hello Luke.
>
> Using offsets seems really doubtful to me. What comes to my mind is
> pre-analyzed field
>
> https://solr.apache.org/guide/solr/latest/indexing-guide/external-files-processe
> s.html#the-preanalyzedfield-type
> <https://solr.apache.org/guide/solr/latest/indexing-guide/external-files-processes.html#the-preanalyzedfield-type>
> .
> Thus, external NLP service can provide ready-made tokens for
> straightforward indexing by Solr. That external NLP will have all power to
> inject or suspend synonyms depending on the context, and supply additional
> attributes in payload (whenever it's boldness, negative/positive stress,
> etc) for retrieval of these payloads later.
>
> On Wed, Feb 22, 2023 at 6:43 AM Luke Kot-Zaniewski (BLOOMBERG/ 919 3RD A) <
> lkotzaniewsk@bloomberg.net> wrote:
>
> > Hi All,
> >
> > I am trying to enrich a lucene-powered search index with data from
> various
> > different NLP systems that are distributed throughout my company. Ideally
> > this internally-derived data could be tied back to specific positions of
> > the original text. I’ve searched around and this is the closest thing
> I’ve
> > found to what I am trying to do
> > https://jorgelbg.me/2018/03/solr-contextual-synonyms-with-payloads/ .
> > This approach is neat but it has a few drawbacks because of its reliance
> on
> > injected delimiters and somewhat inflexible passing of data from
> > PayloadAttribute to CharTermAttribute.
> >
> > One thing that occurred to me is to use an offset-based approach, of
> > course assuming input text is already properly encoded and sanitized. I’m
> > thinking about implementing a CharFilter that decodes some special
> header,
> > which itself passes along an offset-sorted list of data for enrichment.
> > This metadata could be referenced during analysis via custom attributes
> and
> > ideally could handle a variety of use cases with the same
> offset-accounting
> > logic. Some uses that come to mind are stashing values in term/payload
> > attributes or even offset based tokenization for those wishing to
> tokenize
> > outside of their search engine.
> >
> > Does this approach even make any sense or have any pitfalls I am failing
> > to see? Assuming it makes sense does a similar solution already exist? If
> > it doesn’t exist yet would it be something that would be of interest to
> the
> > community?
> > Any thoughts on this would be much appreciated.
> >
> > Thanks,
> > Luke
>
>
> --
> Sincerely yours
> Mikhail Khludnev
> https://t.me/MUST_SEARCH
> A caveat: Cyrillic!
>
>
>

--
Sincerely yours
Mikhail Khludnev
https://t.me/MUST_SEARCH
A caveat: Cyrillic!