Mailing List Archive: Integrating NLP into Lucene Analysis Chain

Integrating NLP into Lucene Analysis Chain

Nov 19, 2022, 7:26 PM

Post #1 of 8 (482 views)

Greetings,
I would greatly appreciate anyone sharing their experience doing NLP/lemmatization and am also very curious to gauge the opinion of the lucene community regarding open-nlp. I know there are a few other libraries out there, some of which can’t be directly included in the lucene project because of licensing issues. If anyone has any suggestions/experiences, please do share them :-)
As a side note I’ll add that I’ve been experimenting with open-nlp’s PoS/lemmatization capabilities via lucene’s integration. During the process I uncovered some issues which made me question whether open-nlp is the right tool for the job. The first issue was a “low-hanging bug”, which would have most likely been addressed sooner if this solution was popular, this simple bug was at least 5 years old -> https://github.com/apache/lucene/issues/11771

Second issue has more to do with the open-nlp library itself. It is not thread-safe in some very unexpected ways. Looking at the library internals reveals unsynchronized lazy initialization of shared components. Unfortunately the lucene integration kind of sweeps this under the rug by wrapping everything in a pretty big synchronized block, here is an example https://github.com/apache/lucene/blob/main/lucene/analysis/opennlp/src/java/org/apache/lucene/analysis/opennlp/tools/NLPPOSTaggerOp.java#L36 . This itself is problematic because these functions run in really tight loops and probably shouldn’t be blocking. Even if one did decide to do blocking initialization, it can still be done at a much lower level than currently. From what I gather, the functions that are synchronized at the lucene-level could be made thread-safe in a much more performant way if they were fixed in open-nlp. But I am also starting to doubt if this is worth pursuing since I don't know whether anyone would find this useful, hence the original inquiry.
I’ll add that I have separately used the open-nlp sentence break iterator (which suffers from the same problem https://github.com/apache/lucene/blob/main/lucene/analysis/opennlp/src/java/org/apache/lucene/analysis/opennlp/tools/NLPSentenceDetectorOp.java#L39 ) at production scale and discovered really bad performance during certain conditions which I attribute to this unnecessary synching. I suspect this may have impacted others as well https://stackoverflow.com/questions/42960569/indexing-taking-long-time-when-using-opennlp-lemmatizer-with-solr
Many thanks,
Luke Kot-Zaniewski

Re: Integrating NLP into Lucene Analysis Chain [ In reply to ]

rcmuir at gmail

Nov 19, 2022, 7:43 PM

Post #2 of 8 (482 views)

Permalink

Hi,

Is this 'synchronized' really needed?

1. Lucene tokenstreams are only used by a single thread. If you index
with 10 threads, 10 tokenstreams are used.
2. These OpenNLP Factories make a new *Op for each tokenstream that
they create. so there's no thread hazard.
3. If i remove 'synchronized' keyword everywhere from opennlp module
(NLPChunkerOp, NLPNERTaggerOp, NLPPOSTaggerOp, NLPSentenceDetectorOp,
NLPTokenizerOp), then all the tests pass.

On Sat, Nov 19, 2022 at 10:26 PM Luke Kot-Zaniewski (BLOOMBERG/ 919
3RD A) <lkotzaniewsk@bloomberg.net> wrote:
>
> Greetings,
> I would greatly appreciate anyone sharing their experience doing NLP/lemmatization and am also very curious to gauge the opinion of the lucene community regarding open-nlp. I know there are a few other libraries out there, some of which can’t be directly included in the lucene project because of licensing issues. If anyone has any suggestions/experiences, please do share them :-)
> As a side note I’ll add that I’ve been experimenting with open-nlp’s PoS/lemmatization capabilities via lucene’s integration. During the process I uncovered some issues which made me question whether open-nlp is the right tool for the job. The first issue was a “low-hanging bug”, which would have most likely been addressed sooner if this solution was popular, this simple bug was at least 5 years old -> https://github.com/apache/lucene/issues/11771
>
> Second issue has more to do with the open-nlp library itself. It is not thread-safe in some very unexpected ways. Looking at the library internals reveals unsynchronized lazy initialization of shared components. Unfortunately the lucene integration kind of sweeps this under the rug by wrapping everything in a pretty big synchronized block, here is an example https://github.com/apache/lucene/blob/main/lucene/analysis/opennlp/src/java/org/apache/lucene/analysis/opennlp/tools/NLPPOSTaggerOp.java#L36 . This itself is problematic because these functions run in really tight loops and probably shouldn’t be blocking. Even if one did decide to do blocking initialization, it can still be done at a much lower level than currently. From what I gather, the functions that are synchronized at the lucene-level could be made thread-safe in a much more performant way if they were fixed in open-nlp. But I am also starting to doubt if this is worth pursuing since I don't know whether anyone would find this useful, hence the original inquiry.
> I’ll add that I have separately used the open-nlp sentence break iterator (which suffers from the same problem https://github.com/apache/lucene/blob/main/lucene/analysis/opennlp/src/java/org/apache/lucene/analysis/opennlp/tools/NLPSentenceDetectorOp.java#L39 ) at production scale and discovered really bad performance during certain conditions which I attribute to this unnecessary synching. I suspect this may have impacted others as well https://stackoverflow.com/questions/42960569/indexing-taking-long-time-when-using-opennlp-lemmatizer-with-solr
> Many thanks,
> Luke Kot-Zaniewski
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Integrating NLP into Lucene Analysis Chain [ In reply to ]

rcmuir at gmail

Nov 19, 2022, 7:59 PM

Post #3 of 8 (482 views)

Permalink

https://github.com/apache/lucene/pull/11955

On Sat, Nov 19, 2022 at 10:43 PM Robert Muir <rcmuir@gmail.com> wrote:
>
> Hi,
>
> Is this 'synchronized' really needed?
>
> 1. Lucene tokenstreams are only used by a single thread. If you index
> with 10 threads, 10 tokenstreams are used.
> 2. These OpenNLP Factories make a new *Op for each tokenstream that
> they create. so there's no thread hazard.
> 3. If i remove 'synchronized' keyword everywhere from opennlp module
> (NLPChunkerOp, NLPNERTaggerOp, NLPPOSTaggerOp, NLPSentenceDetectorOp,
> NLPTokenizerOp), then all the tests pass.
>
> On Sat, Nov 19, 2022 at 10:26 PM Luke Kot-Zaniewski (BLOOMBERG/ 919
> 3RD A) <lkotzaniewsk@bloomberg.net> wrote:
> >
> > Greetings,
> > I would greatly appreciate anyone sharing their experience doing NLP/lemmatization and am also very curious to gauge the opinion of the lucene community regarding open-nlp. I know there are a few other libraries out there, some of which can’t be directly included in the lucene project because of licensing issues. If anyone has any suggestions/experiences, please do share them :-)
> > As a side note I’ll add that I’ve been experimenting with open-nlp’s PoS/lemmatization capabilities via lucene’s integration. During the process I uncovered some issues which made me question whether open-nlp is the right tool for the job. The first issue was a “low-hanging bug”, which would have most likely been addressed sooner if this solution was popular, this simple bug was at least 5 years old -> https://github.com/apache/lucene/issues/11771
> >
> > Second issue has more to do with the open-nlp library itself. It is not thread-safe in some very unexpected ways. Looking at the library internals reveals unsynchronized lazy initialization of shared components. Unfortunately the lucene integration kind of sweeps this under the rug by wrapping everything in a pretty big synchronized block, here is an example https://github.com/apache/lucene/blob/main/lucene/analysis/opennlp/src/java/org/apache/lucene/analysis/opennlp/tools/NLPPOSTaggerOp.java#L36 . This itself is problematic because these functions run in really tight loops and probably shouldn’t be blocking. Even if one did decide to do blocking initialization, it can still be done at a much lower level than currently. From what I gather, the functions that are synchronized at the lucene-level could be made thread-safe in a much more performant way if they were fixed in open-nlp. But I am also starting to doubt if this is worth pursuing since I don't know whether anyone would find this useful, hence the original inquiry.
> > I’ll add that I have separately used the open-nlp sentence break iterator (which suffers from the same problem https://github.com/apache/lucene/blob/main/lucene/analysis/opennlp/src/java/org/apache/lucene/analysis/opennlp/tools/NLPSentenceDetectorOp.java#L39 ) at production scale and discovered really bad performance during certain conditions which I attribute to this unnecessary synching. I suspect this may have impacted others as well https://stackoverflow.com/questions/42960569/indexing-taking-long-time-when-using-opennlp-lemmatizer-with-solr
> > Many thanks,
> > Luke Kot-Zaniewski
> >

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Integrating NLP into Lucene Analysis Chain [ In reply to ]

benoit.mercibe at gmail

Nov 21, 2022, 10:19 AM

Post #4 of 8 (482 views)

Permalink

Hi Luke,

Thank you for your work and information sharing. From my point of view
lemmatization is just a use case of text token annotation. I have been
working with Lucene since 2006 to index lexicographic and linguistic
data and I always miss the fact that (1) token attributes are not
searchable and (2) that it is not straightforward to get all text tokens
indexed at the same position (synonyms) directly from a span query
(ideas and suggestions are welcome!). I think that the NLP community
might be grateful if Lucene could offer a simple way to search on token
annotations (attributes). MTAS project achieve that
(https://github.com/textexploration/mtas), based on Lucene, and supports
the CQL Query Language
(https://meertensinstituut.github.io/mtas/search_cql.html). MTAS is an
inspiring project I came accross recently and from which you might get
inspiration too. But I am currently hesitating to use it because I have
no guarantee that they authors will port their code to support new
Lucene versions. I might come with my own solution but without (2) I
can't see yet how I could achieve it simply without redoing the same
thing that MTAS did!

Thank you.

Benoit

Le 2022-11-19 à 22 h 26, Luke Kot-Zaniewski (BLOOMBERG/ 919 3RD A) a écrit :
> Greetings,
> I would greatly appreciate anyone sharing their experience doing NLP/lemmatization and am also very curious to gauge the opinion of the lucene community regarding open-nlp. I know there are a few other libraries out there, some of which can’t be directly included in the lucene project because of licensing issues. If anyone has any suggestions/experiences, please do share them :-)
> As a side note I’ll add that I’ve been experimenting with open-nlp’s PoS/lemmatization capabilities via lucene’s integration. During the process I uncovered some issues which made me question whether open-nlp is the right tool for the job. The first issue was a “low-hanging bug”, which would have most likely been addressed sooner if this solution was popular, this simple bug was at least 5 years old -> https://github.com/apache/lucene/issues/11771
>
> Second issue has more to do with the open-nlp library itself. It is not thread-safe in some very unexpected ways. Looking at the library internals reveals unsynchronized lazy initialization of shared components. Unfortunately the lucene integration kind of sweeps this under the rug by wrapping everything in a pretty big synchronized block, here is an example https://github.com/apache/lucene/blob/main/lucene/analysis/opennlp/src/java/org/apache/lucene/analysis/opennlp/tools/NLPPOSTaggerOp.java#L36 . This itself is problematic because these functions run in really tight loops and probably shouldn’t be blocking. Even if one did decide to do blocking initialization, it can still be done at a much lower level than currently. From what I gather, the functions that are synchronized at the lucene-level could be made thread-safe in a much more performant way if they were fixed in open-nlp. But I am also starting to doubt if this is worth pursuing since I don't know whether anyone would find this useful, hence the original inquiry.
> I’ll add that I have separately used the open-nlp sentence break iterator (which suffers from the same problem https://github.com/apache/lucene/blob/main/lucene/analysis/opennlp/src/java/org/apache/lucene/analysis/opennlp/tools/NLPSentenceDetectorOp.java#L39 ) at production scale and discovered really bad performance during certain conditions which I attribute to this unnecessary synching. I suspect this may have impacted others as well https://stackoverflow.com/questions/42960569/indexing-taking-long-time-when-using-opennlp-lemmatizer-with-solr
> Many thanks,
> Luke Kot-Zaniewski
>
B?KKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKCB??[??X???X?KK[XZ[??]?K]\?\?][??X???X?PX?[?K?\X?K???B???Y][?[??[X[??K[XZ[??]?K]\?\?Z[X?[?K?\X?K???B?B

Re: Integrating NLP into Lucene Analysis Chain [ In reply to ]

mkhl at apache

Nov 21, 2022, 11:48 AM

Post #5 of 8 (482 views)

Permalink

Hello, Benoit.

I just came across
https://lucene.apache.org/core/8_0_0/analyzers-common/org/apache/lucene/analysis/miscellaneous/TypeAsSynonymFilterFactory.html

It sounds similar to what you asking, but it watches TypeAttribute only.
Also, spans are superseded with intervals
https://lucene.apache.org/core/8_2_0/queries/org/apache/lucene/queries/intervals/IntervalQuery.html
It's better to approach them.

On Mon, Nov 21, 2022 at 9:20 PM Benoit Mercier <benoit.mercibe@gmail.com>
wrote:

> Hi Luke,
>
> Thank you for your work and information sharing. From my point of view
> lemmatization is just a use case of text token annotation. I have been
> working with Lucene since 2006 to index lexicographic and linguistic
> data and I always miss the fact that (1) token attributes are not
> searchable and (2) that it is not straightforward to get all text tokens
> indexed at the same position (synonyms) directly from a span query
> (ideas and suggestions are welcome!). I think that the NLP community
> might be grateful if Lucene could offer a simple way to search on token
> annotations (attributes). MTAS project achieve that
> (https://github.com/textexploration/mtas), based on Lucene, and supports
> the CQL Query Language
> (https://meertensinstituut.github.io/mtas/search_cql.html). MTAS is an
> inspiring project I came accross recently and from which you might get
> inspiration too. But I am currently hesitating to use it because I have
> no guarantee that they authors will port their code to support new
> Lucene versions. I might come with my own solution but without (2) I
> can't see yet how I could achieve it simply without redoing the same
> thing that MTAS did!
>
> Thank you.
>
> Benoit
>
> Le 2022-11-19 à 22 h 26, Luke Kot-Zaniewski (BLOOMBERG/ 919 3RD A) a
> écrit :
> > Greetings,
> > I would greatly appreciate anyone sharing their experience doing
> NLP/lemmatization and am also very curious to gauge the opinion of the
> lucene community regarding open-nlp. I know there are a few other libraries
> out there, some of which can’t be directly included in the lucene project
> because of licensing issues. If anyone has any suggestions/experiences,
> please do share them :-)
> > As a side note I’ll add that I’ve been experimenting with open-nlp’s
> PoS/lemmatization capabilities via lucene’s integration. During the process
> I uncovered some issues which made me question whether open-nlp is the
> right tool for the job. The first issue was a “low-hanging bug”, which
> would have most likely been addressed sooner if this solution was popular,
> this simple bug was at least 5 years old ->
> https://github.com/apache/lucene/issues/11771
> >
> > Second issue has more to do with the open-nlp library itself. It is not
> thread-safe in some very unexpected ways. Looking at the library internals
> reveals unsynchronized lazy initialization of shared components.
> Unfortunately the lucene integration kind of sweeps this under the rug by
> wrapping everything in a pretty big synchronized block, here is an example
> https://github.com/apache/lucene/blob/main/lucene/analysis/opennlp/src/java/org/apache/lucene/analysis/opennlp/tools/NLPPOSTaggerOp.java#L36
> . This itself is problematic because these functions run in really tight
> loops and probably shouldn’t be blocking. Even if one did decide to do
> blocking initialization, it can still be done at a much lower level than
> currently. From what I gather, the functions that are synchronized at the
> lucene-level could be made thread-safe in a much more performant way if
> they were fixed in open-nlp. But I am also starting to doubt if this is
> worth pursuing since I don't know whether anyone would find this useful,
> hence the original inquiry.
> > I’ll add that I have separately used the open-nlp sentence break
> iterator (which suffers from the same problem
> https://github.com/apache/lucene/blob/main/lucene/analysis/opennlp/src/java/org/apache/lucene/analysis/opennlp/tools/NLPSentenceDetectorOp.java#L39
> ) at production scale and discovered really bad performance during certain
> conditions which I attribute to this unnecessary synching. I suspect this
> may have impacted others as well
> https://stackoverflow.com/questions/42960569/indexing-taking-long-time-when-using-opennlp-lemmatizer-with-solr
> > Many thanks,
> > Luke Kot-Zaniewski
> >
>

--
Sincerely yours
Mikhail Khludnev

RE: Integrating NLP into Lucene Analysis Chain [ In reply to ]

wanggu at med

Nov 21, 2022, 12:12 PM

Post #6 of 8 (482 views)

Permalink

Hi Luke,

For what you've described as a "bug" for NLPPOSTaggerOp, I do agree with you that there could be a more elegant solution than simply synchronizing the entire method. That has been said, IMHO, I don't see there is a thread-safe issue. Lucene TokenFilters are not supposed to be shared among threads. They can be re-used among threads though.

NLPs, stemming for example, on the other hand, are slow. If you have to put NLP processing inside the analysis chain, you may have to give up certain NLP capacities...

My 2cents,

Guan

-----Original Message-----
From: Luke Kot-Zaniewski (BLOOMBERG/ 919 3RD A) <lkotzaniewsk@bloomberg.net>
Sent: Saturday, November 19, 2022 10:27 PM
To: java-user@lucene.apache.org
Subject: Integrating NLP into Lucene Analysis Chain

External Email - Use Caution

Greetings,
I would greatly appreciate anyone sharing their experience doing NLP/lemmatization and am also very curious to gauge the opinion of the lucene community regarding open-nlp. I know there are a few other libraries out there, some of which can’t be directly included in the lucene project because of licensing issues. If anyone has any suggestions/experiences, please do share them :-) As a side note I’ll add that I’ve been experimenting with open-nlp’s PoS/lemmatization capabilities via lucene’s integration. During the process I uncovered some issues which made me question whether open-nlp is the right tool for the job. The first issue was a “low-hanging bug”, which would have most likely been addressed sooner if this solution was popular, this simple bug was at least 5 years old -> https://github.com/apache/lucene/issues/11771

Second issue has more to do with the open-nlp library itself. It is not thread-safe in some very unexpected ways. Looking at the library internals reveals unsynchronized lazy initialization of shared components. Unfortunately the lucene integration kind of sweeps this under the rug by wrapping everything in a pretty big synchronized block, here is an example https://github.com/apache/lucene/blob/main/lucene/analysis/opennlp/src/java/org/apache/lucene/analysis/opennlp/tools/NLPPOSTaggerOp.java#L36 . This itself is problematic because these functions run in really tight loops and probably shouldn’t be blocking. Even if one did decide to do blocking initialization, it can still be done at a much lower level than currently. From what I gather, the functions that are synchronized at the lucene-level could be made thread-safe in a much more performant way if they were fixed in open-nlp. But I am also starting to doubt if this is worth pursuing since I don't know whether anyone would find this useful, hence the original inquiry.
I’ll add that I have separately used the open-nlp sentence break iterator (which suffers from the same problem https://github.com/apache/lucene/blob/main/lucene/analysis/opennlp/src/java/org/apache/lucene/analysis/opennlp/tools/NLPSentenceDetectorOp.java#L39 ) at production scale and discovered really bad performance during certain conditions which I attribute to this unnecessary synching. I suspect this may have impacted others as well https://stackoverflow.com/questions/42960569/indexing-taking-long-time-when-using-opennlp-lemmatizer-with-solr
Many thanks,
Luke Kot-Zaniewski

**********************************************************
Electronic Mail is not secure, may not be read every day, and should not be used for urgent or sensitive issues
B?KKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKCB??[??X???X?KK[XZ[??]?K]\?\?][??X???X?PX?[?K?\X?K???B???Y][?[??[X[??K[XZ[??]?K]\?\?Z[X?[?K?\X?K???B?B

RE: RE: Integrating NLP into Lucene Analysis Chain [ In reply to ]

l.kotzaniewski at gmail

Nov 22, 2022, 7:37 AM

Post #7 of 8 (479 views)

Permalink

Hi Guan,

I think I've confused everyone a little bit, including myself. When I
initially went down the rabbit hole of understanding the synchronization of
these wrapping methods I kept an eye out for all potential thread safety
issues within open-nlp. I ended up finding issues unrelated to the
synchronized methods at hand. Most notably, open-nlp does unsafe member
initialization in a couple of places within shared factories such as
POSTaggerFactory that I described in more detail in the linked PR. These
unsafe methods actually get called in parallel from lucene's
FilterFactory::create. I've simply short-circuited these factories in my
application and I am still deciding what to do long term.

Luke

On 2022/11/21 20:12:34 "Wang, Guan" wrote:
> Hi Luke,
>
> For what you've described as a "bug" for NLPPOSTaggerOp, I do agree with
you that there could be a more elegant solution than simply synchronizing
the entire method. That has been said, IMHO, I don't see there is a
thread-safe issue. Lucene TokenFilters are not supposed to be shared among
threads. They can be re-used among threads though.
>
> NLPs, stemming for example, on the other hand, are slow. If you have to
put NLP processing inside the analysis chain, you may have to give up
certain NLP capacities...
>
> My 2cents,
>
> Guan
>
> -----Original Message-----
> From: Luke Kot-Zaniewski (BLOOMBERG/ 919 3RD A) <lk...@bloomberg.net>
> Sent: Saturday, November 19, 2022 10:27 PM
> To: java-user@lucene.apache.org
> Subject: Integrating NLP into Lucene Analysis Chain
>
> External Email - Use Caution
>
> Greetings,
> I would greatly appreciate anyone sharing their experience doing
NLP/lemmatization and am also very curious to gauge the opinion of the
lucene community regarding open-nlp. I know there are a few other libraries
out there, some of which can’t be directly included in the lucene project
because of licensing issues. If anyone has any suggestions/experiences,
please do share them :-) As a side note I’ll add that I’ve been
experimenting with open-nlp’s PoS/lemmatization capabilities via lucene’s
integration. During the process I uncovered some issues which made me
question whether open-nlp is the right tool for the job. The first issue
was a “low-hanging bug”, which would have most likely been addressed sooner
if this solution was popular, this simple bug was at least 5 years old ->
https://github.com/apache/lucene/issues/11771
>
> Second issue has more to do with the open-nlp library itself. It is not
thread-safe in some very unexpected ways. Looking at the library internals
reveals unsynchronized lazy initialization of shared components.
Unfortunately the lucene integration kind of sweeps this under the rug by
wrapping everything in a pretty big synchronized block, here is an example
https://github.com/apache/lucene/blob/main/lucene/analysis/opennlp/src/java/org/apache/lucene/analysis/opennlp/tools/NLPPOSTaggerOp.java#L36
. This itself is problematic because these functions run in really tight
loops and probably shouldn’t be blocking. Even if one did decide to do
blocking initialization, it can still be done at a much lower level than
currently. From what I gather, the functions that are synchronized at the
lucene-level could be made thread-safe in a much more performant way if
they were fixed in open-nlp. But I am also starting to doubt if this is
worth pursuing since I don't know whether anyone would find this useful,
hence the original inquiry.
> I’ll add that I have separately used the open-nlp sentence break iterator
(which suffers from the same problem
https://github.com/apache/lucene/blob/main/lucene/analysis/opennlp/src/java/org/apache/lucene/analysis/opennlp/tools/NLPSentenceDetectorOp.java#L39
) at production scale and discovered really bad performance during certain
conditions which I attribute to this unnecessary synching. I suspect this
may have impacted others as well
https://stackoverflow.com/questions/42960569/indexing-taking-long-time-when-using-opennlp-lemmatizer-with-solr
> Many thanks,
> Luke Kot-Zaniewski
>
> **********************************************************
> Electronic Mail is not secure, may not be read every day, and should not
be used for urgent or sensitive issues
>

RE: Re: Integrating NLP into Lucene Analysis Chain [ In reply to ]

l.kotzaniewski at gmail

Nov 22, 2022, 7:53 AM

Post #8 of 8 (479 views)

Permalink

Hi Benoit,

Thanks for the reply and link! My application is english-focused so I have
the benefit of having a language with little inflection. This along with a
few other reasons pushed me towards an index-heavy approach which doesn't
have the complexities involved with synonyms of different position length
(i.e. where you would need SynonymGraphFilter) and it also simplifies query
composition. Having said that, I found that creating a custom filter that
packs equal length synonym tokens into the same position to be relatively
simple.

Luke

On 2022/11/21 18:19:56 Benoit Mercier wrote:
> Hi Luke,
>
> Thank you for your work and information sharing. From my point of view
> lemmatization is just a use case of text token annotation. I have been
> working with Lucene since 2006 to index lexicographic and linguistic
> data and I always miss the fact that (1) token attributes are not
> searchable and (2) that it is not straightforward to get all text tokens
> indexed at the same position (synonyms) directly from a span query
> (ideas and suggestions are welcome!). I think that the NLP community
> might be grateful if Lucene could offer a simple way to search on token
> annotations (attributes). MTAS project achieve that
> (https://github.com/textexploration/mtas), based on Lucene, and supports
> the CQL Query Language
> (https://meertensinstituut.github.io/mtas/search_cql.html). MTAS is an
> inspiring project I came accross recently and from which you might get
> inspiration too. But I am currently hesitating to use it because I have
> no guarantee that they authors will port their code to support new
> Lucene versions. I might come with my own solution but without (2) I
> can't see yet how I could achieve it simply without redoing the same
> thing that MTAS did!
>
> Thank you.
>
> Benoit
>
> Le 2022-11-19 à 22 h 26, Luke Kot-Zaniewski (BLOOMBERG/ 919 3RD A) a
écrit :
> > Greetings,
> > I would greatly appreciate anyone sharing their experience doing
NLP/lemmatization and am also very curious to gauge the opinion of the
lucene community regarding open-nlp. I know there are a few other libraries
out there, some of which can’t be directly included in the lucene project
because of licensing issues. If anyone has any suggestions/experiences,
please do share them :-)
> > As a side note I’ll add that I’ve been experimenting with open-nlp’s
PoS/lemmatization capabilities via lucene’s integration. During the process
I uncovered some issues which made me question whether open-nlp is the
right tool for the job. The first issue was a “low-hanging bug”, which
would have most likely been addressed sooner if this solution was popular,
this simple bug was at least 5 years old ->
https://github.com/apache/lucene/issues/11771
> >
> > Second issue has more to do with the open-nlp library itself. It is not
thread-safe in some very unexpected ways. Looking at the library internals
reveals unsynchronized lazy initialization of shared components.
Unfortunately the lucene integration kind of sweeps this under the rug by
wrapping everything in a pretty big synchronized block, here is an example
https://github.com/apache/lucene/blob/main/lucene/analysis/opennlp/src/java/org/apache/lucene/analysis/opennlp/tools/NLPPOSTaggerOp.java#L36
. This itself is problematic because these functions run in really tight
loops and probably shouldn’t be blocking. Even if one did decide to do
blocking initialization, it can still be done at a much lower level than
currently. From what I gather, the functions that are synchronized at the
lucene-level could be made thread-safe in a much more performant way if
they were fixed in open-nlp. But I am also starting to doubt if this is
worth pursuing since I don't know whether anyone would find this useful,
hence the original inquiry.
> > I’ll add that I have separately used the open-nlp sentence break
iterator (which suffers from the same problem
https://github.com/apache/lucene/blob/main/lucene/analysis/opennlp/src/java/org/apache/lucene/analysis/opennlp/tools/NLPSentenceDetectorOp.java#L39
) at production scale and discovered really bad performance during certain
conditions which I attribute to this unnecessary synching. I suspect this
may have impacted others as well
https://stackoverflow.com/questions/42960569/indexing-taking-long-time-when-using-opennlp-lemmatizer-with-solr
> > Many thanks,
> > Luke Kot-Zaniewski
> >
>