Mailing List Archive

Backporting of Nori
Hello Dev's at Lucene, I'm Roy.
Currently we are using the Java implementation, as well as the dotnet implementation of Lucene.
We've greatly enjoyed your works in our efforts to expand our app to accommodate and analyze multiple languages.
However, I've encountered a roadblock while working with the Korean analyzer.
There is a adequate implementation of the Korean (Nori) analyzer in Java, but not for dotnet.
Upon communicating and working with the developers at Lucene dotnet, there seem to be some changes to the dependencies which has made debugging almost impossible with the oldest versions of the Korean Analyzer.
The developers at Lucene dotnet are requesting that a backport of the Korean Analyzer to Java 4.8.0 to serve as a basis for porting the Java implementation to dotnet.
If this version could be provided in a dedicated branch, it'd be greatly appreciated so that this work can move forward, as many other developers are also anticipating for this feature to be made available in dotnet.

Thank you for your contributions!
Best regards,
Roy Hwang
RE: Backporting of Nori [ In reply to ]
Hello,

To clarify, this is regarding the GitHub PR and Lucene JIRA ticket where you can read more info:

https://github.com/apache/lucenenet/pull/645
https://issues.apache.org/jira/browse/LUCENE-8231

We attempted to port the Nori analysis package to .NET 3 years ago from Lucene 8.2.0 to Lucene.NET 4.8.0 and got it all to work except for 6 tests. 3 of them are dealing with the KoreanNumberFilter, but we are certain we can fix those without your assistance.

However, there are 3 tests that fail due to changes to the FST implementation between Lucene 4.8.0 and 8.2.0:


1. TestKoreanTokenizer.testRandomHugeStringsMockGraphAfter()
2. TestKoreanTokenizerFactory.testUserDict()
3. UserDictionaryTest.TstLookup()

We attempted:


1. Re-building the dictionaries using the Lucene.NET 4.8.0 FST using mecab-ko-dic-2.0.3-20170922.tar.gz<https://bitbucket.org/eunjeon/mecab-ko-dic/downloads/mecab-ko-dic-2.0.3-20170922.tar.gz>.
2. Porting the Lucene 8.2.0 FST over to the Lucene.Net.Analysis.Nori project and rewiring Nori to be the only project that uses it.

Unfortunately, the former doesn't change the results of the tests and the latter simply fails the check because we are below Lucene 6. Removing the check still has invalid FST output, but it is clear that it wasn't supposed to be compatible.

Our most current attempt is at: https://github.com/NightOwl888/lucenenet/tree/feature/analysis-nori-2

It is far simpler to backport the Nori package to Lucene.NET 4.8.0 than it is to upgrade the entire project to at least 7.4.0. So, we could use your assistance to help us convert the Nori package to be compatible with the FST in Lucene 4.8.0.

Some Options (there may be more)


1. Backport the analyzers-nori package from either the latest Lucene version or 8.2.0 to Lucene 4.8.0 (in Java). Once it is functional, we can use it as a basis to both port and compare execution to find any bugs on our end.
2. Provide us with the high-level info on how the FST package has changed between 4.8.0 and 8.2.0 (or the latest version) so we can make the backport. We need some sort of map to follow to understand the changes at the binary level.

Note that we maintain a copy of Lucene 4.8.0 for debugging purposes because the Maven artifacts are now stale and we had to upgrade them to get the build to work: https://github.com/NightOwl888/lucene/tree/releases/lucene-solr/4.8.0/updated

The first option would work the best for us, primarily because FST is a bit of a puzzle that we haven't dealt with at a high level, but we are willing to learn if you can point the way.

Thanks,
Shad Storhaug (NightOwl888)
Project Chairperson - Apache Lucene.NET

From: Roy Hwang <r.hwang@criteo.com.INVALID>
Sent: Saturday, October 15, 2022 12:41 AM
To: dev@lucene.apache.org
Subject: Backporting of Nori

Hello Dev's at Lucene, I'm Roy.
Currently we are using the Java implementation, as well as the dotnet implementation of Lucene.
We've greatly enjoyed your works in our efforts to expand our app to accommodate and analyze multiple languages.
However, I've encountered a roadblock while working with the Korean analyzer.
There is a adequate implementation of the Korean (Nori) analyzer in Java, but not for dotnet.
Upon communicating and working with the developers at Lucene dotnet, there seem to be some changes to the dependencies which has made debugging almost impossible with the oldest versions of the Korean Analyzer.
The developers at Lucene dotnet are requesting that a backport of the Korean Analyzer to Java 4.8.0 to serve as a basis for porting the Java implementation to dotnet.
If this version could be provided in a dedicated branch, it'd be greatly appreciated so that this work can move forward, as many other developers are also anticipating for this feature to be made available in dotnet.

Thank you for your contributions!
Best regards,
Roy Hwang
Re: Backporting of Nori [ In reply to ]
Hello Roy and Shad,

What you are asking is not straightforward, I worry it would take me a lot
of time and I'm not even sure I would succeed, and I would assume that
other committers who read your email felt the same way. My preferred path
forward would be to delay support of Korean in Lucene.NET until you can
upgrade to a more recent version of Apache Lucene.

On Tue, Oct 18, 2022 at 9:31 PM Shad Storhaug <shad@shadstorhaug.com> wrote:

> Hello,
>
>
>
> To clarify, this is regarding the GitHub PR and Lucene JIRA ticket where
> you can read more info:
>
>
>
> https://github.com/apache/lucenenet/pull/645
>
> https://issues.apache.org/jira/browse/LUCENE-8231
>
>
>
> We attempted to port the Nori analysis package to .NET 3 years ago from
> Lucene 8.2.0 to Lucene.NET 4.8.0 and got it all to work except for 6 tests.
> 3 of them are dealing with the KoreanNumberFilter, but we are certain we
> can fix those without your assistance.
>
>
>
> However, there are 3 tests that fail due to changes to the FST
> implementation between Lucene 4.8.0 and 8.2.0:
>
>
>
> 1. TestKoreanTokenizer.testRandomHugeStringsMockGraphAfter()
> 2. TestKoreanTokenizerFactory.testUserDict()
> 3. UserDictionaryTest.TstLookup()
>
>
>
> We attempted:
>
>
>
> 1. Re-building the dictionaries using the Lucene.NET 4.8.0 FST using
> mecab-ko-dic-2.0.3-20170922.tar.gz
> <https://bitbucket.org/eunjeon/mecab-ko-dic/downloads/mecab-ko-dic-2.0.3-20170922.tar.gz>.
>
> 2. Porting the Lucene 8.2.0 FST over to the Lucene.Net.Analysis.Nori
> project and rewiring Nori to be the only project that uses it.
>
>
>
> Unfortunately, the former doesn’t change the results of the tests and the
> latter simply fails the check because we are below Lucene 6. Removing the
> check still has invalid FST output, but it is clear that it wasn’t supposed
> to be compatible.
>
>
>
> Our most current attempt is at:
> https://github.com/NightOwl888/lucenenet/tree/feature/analysis-nori-2
>
>
>
> It is far simpler to backport the Nori package to Lucene.NET 4.8.0 than it
> is to upgrade the entire project to at least 7.4.0. So, we could use your
> assistance to help us convert the Nori package to be compatible with the
> FST in Lucene 4.8.0.
>
>
>
> Some Options (there may be more)
>
>
>
> 1. Backport the analyzers-nori package from either the latest Lucene
> version or 8.2.0 to Lucene 4.8.0 (in Java). Once it is functional, we can
> use it as a basis to both port and compare execution to find any bugs on
> our end.
> 2. Provide us with the high-level info on how the FST package has
> changed between 4.8.0 and 8.2.0 (or the latest version) so we can make the
> backport. We need some sort of map to follow to understand the changes at
> the binary level.
>
>
>
> Note that we maintain a copy of Lucene 4.8.0 for debugging purposes
> because the Maven artifacts are now stale and we had to upgrade them to get
> the build to work:
> https://github.com/NightOwl888/lucene/tree/releases/lucene-solr/4.8.0/updated
>
>
>
> The first option would work the best for us, primarily because FST is a
> bit of a puzzle that we haven’t dealt with at a high level, but we are
> willing to learn if you can point the way.
>
>
>
> Thanks,
>
> Shad Storhaug (NightOwl888)
>
> Project Chairperson – Apache Lucene.NET
>
>
>
> *From:* Roy Hwang <r.hwang@criteo.com.INVALID>
> *Sent:* Saturday, October 15, 2022 12:41 AM
> *To:* dev@lucene.apache.org
> *Subject:* Backporting of Nori
>
>
>
> Hello Dev’s at Lucene, I’m Roy.
> Currently we are using the Java implementation, as well as the dotnet
> implementation of Lucene.
> We’ve greatly enjoyed your works in our efforts to expand our app to
> accommodate and analyze multiple languages.
>
> However, I’ve encountered a roadblock while working with the Korean
> analyzer.
>
> There is a adequate implementation of the Korean (Nori) analyzer in Java,
> but not for dotnet.
>
> Upon communicating and working with the developers at Lucene dotnet, there
> seem to be some changes to the dependencies which has made debugging almost
> impossible with the oldest versions of the Korean Analyzer.
>
> The developers at Lucene dotnet are requesting that a backport of the
> Korean Analyzer to Java 4.8.0 to serve as a basis for porting the Java
> implementation to dotnet.
>
> If this version could be provided in a dedicated branch, it’d be greatly
> appreciated so that this work can move forward, as many other developers
> are also anticipating for this feature to be made available in dotnet.
>
>
>
> Thank you for your contributions!
> Best regards,
>
> Roy Hwang
>


--
Adrien