Mailing List Archive: Issue with Japanese User Dictionary

Issue with Japanese User Dictionary

marcd2000 at gmail

Jan 12, 2022, 4:03 PM

Post #1 of 5 (519 views)

Hi,

I had a question about the Japanese user dictionary. We have a user
dictionary that used to work but after attempting to upgrade Lucene, it
fails with the following error:

Caused by: java.lang.RuntimeException: Illegal user dictionary entry ??????
- the concatenated segmentation (?????) does not match the surface form
(??????)
at
org.apache.lucene.analysis.ja.dict.UserDictionary.<init>(UserDictionary.java:123)

The specific commit causing this error is here
<https://github.com/apache/lucene/commit/73ba88a50dec64f367caa88d277c26dfd1d8883b#diff-75fd48fadfd3d011e9c34c4310ef66e9009edfbc738fd82deb5661a8edb5c5d9>.
The only thing that seems to differ is that the characters are full-width
vs half-width, so I was wondering if this is intended behavior or a bug/too
restrictive. Any suggestions for fixing this would be greatly appreciated!
Thanks!

Re: Issue with Japanese User Dictionary [ In reply to ]

msokolov at gmail

Jan 13, 2022, 7:17 AM

Post #2 of 5 (519 views)

Permalink

HI Marc, I wonder if there is a workaround for this issue: eg, could
we have entries for both widths? I wonder if there is some interaction
with an analysis chain that is doing half-width -> full-width
conversion (or vice versa)? I think the UserDictionary has to operate
on pre-analyzed tokens ... although maybe *after* char filtering,
which presumably could handle width conversions. A bunch of rambling,
but maybe the point is - can you share some more information -- what
is the full entry in the dictionary that causes the problem?

On Wed, Jan 12, 2022 at 7:04 PM Marc D'Mello <marcd2000@gmail.com> wrote:
>
> Hi,
>
> I had a question about the Japanese user dictionary. We have a user
> dictionary that used to work but after attempting to upgrade Lucene, it
> fails with the following error:
>
> Caused by: java.lang.RuntimeException: Illegal user dictionary entry ??????
> - the concatenated segmentation (?????) does not match the surface form
> (??????)
> at
> org.apache.lucene.analysis.ja.dict.UserDictionary.<init>(UserDictionary.java:123)
>
> The specific commit causing this error is here
> <https://github.com/apache/lucene/commit/73ba88a50dec64f367caa88d277c26dfd1d8883b#diff-75fd48fadfd3d011e9c34c4310ef66e9009edfbc738fd82deb5661a8edb5c5d9>.
> The only thing that seems to differ is that the characters are full-width
> vs half-width, so I was wondering if this is intended behavior or a bug/too
> restrictive. Any suggestions for fixing this would be greatly appreciated!
> Thanks!

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Issue with Japanese User Dictionary [ In reply to ]

marcd2000 at gmail

Jan 13, 2022, 12:45 PM

Post #3 of 5 (519 views)

Permalink

Hi Mike,

Thanks for the response! I'm actually not super familiar with
UserDictionaries, but looking at the code, it seems like a single line in
the user provided user dictionary corresponds to a single entry? In that
case, here is the line (or entry) that does have both widths that I believe
is causing the problem:

??????,?????,?????,JA??

I'm guess here the surface is ?????? and the concatentated segment is the
first occurrence of ?????. I'm what surface or concatenated segment means
though, and what it would mean semantically to replace the surface with the
full width version or the concatenated segment with the half width version.

Thanks,
Marc

On Thu, Jan 13, 2022 at 7:18 AM Michael Sokolov <msokolov@gmail.com> wrote:

> HI Marc, I wonder if there is a workaround for this issue: eg, could
> we have entries for both widths? I wonder if there is some interaction
> with an analysis chain that is doing half-width -> full-width
> conversion (or vice versa)? I think the UserDictionary has to operate
> on pre-analyzed tokens ... although maybe *after* char filtering,
> which presumably could handle width conversions. A bunch of rambling,
> but maybe the point is - can you share some more information -- what
> is the full entry in the dictionary that causes the problem?
>
> On Wed, Jan 12, 2022 at 7:04 PM Marc D'Mello <marcd2000@gmail.com> wrote:
> >
> > Hi,
> >
> > I had a question about the Japanese user dictionary. We have a user
> > dictionary that used to work but after attempting to upgrade Lucene, it
> > fails with the following error:
> >
> > Caused by: java.lang.RuntimeException: Illegal user dictionary entry
> ??????
> > - the concatenated segmentation (?????) does not match the surface form
> > (??????)
> > at
> >
> org.apache.lucene.analysis.ja.dict.UserDictionary.<init>(UserDictionary.java:123)
> >
> > The specific commit causing this error is here
> > <
> https://github.com/apache/lucene/commit/73ba88a50dec64f367caa88d277c26dfd1d8883b#diff-75fd48fadfd3d011e9c34c4310ef66e9009edfbc738fd82deb5661a8edb5c5d9
> >.
> > The only thing that seems to differ is that the characters are full-width
> > vs half-width, so I was wondering if this is intended behavior or a
> bug/too
> > restrictive. Any suggestions for fixing this would be greatly
> appreciated!
> > Thanks!
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: Issue with Japanese User Dictionary [ In reply to ]

tomoko.uchida.1111 at gmail

Jan 13, 2022, 6:58 PM

Post #4 of 5 (519 views)

Permalink

Hi,

> The only thing that seems to differ is that the characters are full-width
> vs half-width, so I was wondering if this is intended behavior or a bug/too
> restrictive

This is intended behavior. The first column in the user dictionary
must be equal to the concatenated string of the second column in terms
of Unicode codepoint. No normalization such as full-width and
half-width normalization should not be applied (any normalization or
tweak can cause runtime bugs).

2022?1?14?(?) 5:45 Marc D'Mello <marcd2000@gmail.com>:
>
> Hi Mike,
>
> Thanks for the response! I'm actually not super familiar with
> UserDictionaries, but looking at the code, it seems like a single line in
> the user provided user dictionary corresponds to a single entry? In that
> case, here is the line (or entry) that does have both widths that I believe
> is causing the problem:
>
> ??????,?????,?????,JA??
>
> I'm guess here the surface is ?????? and the concatentated segment is the
> first occurrence of ?????. I'm what surface or concatenated segment means
> though, and what it would mean semantically to replace the surface with the
> full width version or the concatenated segment with the half width version.
>
> Thanks,
> Marc
>
>
> On Thu, Jan 13, 2022 at 7:18 AM Michael Sokolov <msokolov@gmail.com> wrote:
>
> > HI Marc, I wonder if there is a workaround for this issue: eg, could
> > we have entries for both widths? I wonder if there is some interaction
> > with an analysis chain that is doing half-width -> full-width
> > conversion (or vice versa)? I think the UserDictionary has to operate
> > on pre-analyzed tokens ... although maybe *after* char filtering,
> > which presumably could handle width conversions. A bunch of rambling,
> > but maybe the point is - can you share some more information -- what
> > is the full entry in the dictionary that causes the problem?
> >
> > On Wed, Jan 12, 2022 at 7:04 PM Marc D'Mello <marcd2000@gmail.com> wrote:
> > >
> > > Hi,
> > >
> > > I had a question about the Japanese user dictionary. We have a user
> > > dictionary that used to work but after attempting to upgrade Lucene, it
> > > fails with the following error:
> > >
> > > Caused by: java.lang.RuntimeException: Illegal user dictionary entry
> > ??????
> > > - the concatenated segmentation (?????) does not match the surface form
> > > (??????)
> > > at
> > >
> > org.apache.lucene.analysis.ja.dict.UserDictionary.<init>(UserDictionary.java:123)
> > >
> > > The specific commit causing this error is here
> > > <
> > https://github.com/apache/lucene/commit/73ba88a50dec64f367caa88d277c26dfd1d8883b#diff-75fd48fadfd3d011e9c34c4310ef66e9009edfbc738fd82deb5661a8edb5c5d9
> > >.
> > > The only thing that seems to differ is that the characters are full-width
> > > vs half-width, so I was wondering if this is intended behavior or a
> > bug/too
> > > restrictive. Any suggestions for fixing this would be greatly
> > appreciated!
> > > Thanks!
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Issue with Japanese User Dictionary [ In reply to ]

tomoko.uchida.1111 at gmail

Jan 29, 2022, 7:25 PM

Post #5 of 5 (503 views)

Permalink

Hi,
Let me explain a brief background or intention of the change.
Basically, character normalization is not a responsibility of a
tokenizer and should not be performed when you "tokenize" texts.
Instead, there are charFilters and tokenFilters that perform
full-width and half-width normalization.

You can use either one for that purpose:
- CJKWidthCharFilter
(https://lucene.apache.org/core/9_0_0/analysis/common/org/apache/lucene/analysis/cjk/CJKWidthCharFilter.html)
- CJKWidthFilter
(https://lucene.apache.org/core/9_0_0/analysis/common/org/apache/lucene/analysis/cjk/CJKWidthFilter.html)

Also, there are more general filters that perform Unicode normalization:
- ICUNormalizer2CharFilter
(https://lucene.apache.org/core/9_0_0/analysis/icu/org/apache/lucene/analysis/icu/ICUNormalizer2CharFilter.html)
- ICUNormalizer2Filter
(https://lucene.apache.org/core/9_0_0/analysis/icu/org/apache/lucene/analysis/icu/ICUNormalizer2Filter.html)

I'd recommend charFilters here if you need a general suggestion - in
most use-cases, character normalization should be done before applying
dictionary-based tokenizers such as JapaneseTokenizer.
JapaneseAnalyzer already includes CJKWidthCharFilter since Lucene 9.0
so you don't need to worry about full-width and half-width
normalization if you use it.

Tomoko

2022?1?14?(?) 11:58 Tomoko Uchida <tomoko.uchida.1111@gmail.com>:
>
> Hi,
>
> > The only thing that seems to differ is that the characters are full-width
> > vs half-width, so I was wondering if this is intended behavior or a bug/too
> > restrictive
>
> This is intended behavior. The first column in the user dictionary
> must be equal to the concatenated string of the second column in terms
> of Unicode codepoint. No normalization such as full-width and
> half-width normalization should not be applied (any normalization or
> tweak can cause runtime bugs).
>
> 2022?1?14?(?) 5:45 Marc D'Mello <marcd2000@gmail.com>:
> >
> > Hi Mike,
> >
> > Thanks for the response! I'm actually not super familiar with
> > UserDictionaries, but looking at the code, it seems like a single line in
> > the user provided user dictionary corresponds to a single entry? In that
> > case, here is the line (or entry) that does have both widths that I believe
> > is causing the problem:
> >
> > ??????,?????,?????,JA??
> >
> > I'm guess here the surface is ?????? and the concatentated segment is the
> > first occurrence of ?????. I'm what surface or concatenated segment means
> > though, and what it would mean semantically to replace the surface with the
> > full width version or the concatenated segment with the half width version.
> >
> > Thanks,
> > Marc
> >
> >
> > On Thu, Jan 13, 2022 at 7:18 AM Michael Sokolov <msokolov@gmail.com> wrote:
> >
> > > HI Marc, I wonder if there is a workaround for this issue: eg, could
> > > we have entries for both widths? I wonder if there is some interaction
> > > with an analysis chain that is doing half-width -> full-width
> > > conversion (or vice versa)? I think the UserDictionary has to operate
> > > on pre-analyzed tokens ... although maybe *after* char filtering,
> > > which presumably could handle width conversions. A bunch of rambling,
> > > but maybe the point is - can you share some more information -- what
> > > is the full entry in the dictionary that causes the problem?
> > >
> > > On Wed, Jan 12, 2022 at 7:04 PM Marc D'Mello <marcd2000@gmail.com> wrote:
> > > >
> > > > Hi,
> > > >
> > > > I had a question about the Japanese user dictionary. We have a user
> > > > dictionary that used to work but after attempting to upgrade Lucene, it
> > > > fails with the following error:
> > > >
> > > > Caused by: java.lang.RuntimeException: Illegal user dictionary entry
> > > ??????
> > > > - the concatenated segmentation (?????) does not match the surface form
> > > > (??????)
> > > > at
> > > >
> > > org.apache.lucene.analysis.ja.dict.UserDictionary.<init>(UserDictionary.java:123)
> > > >
> > > > The specific commit causing this error is here
> > > > <
> > > https://github.com/apache/lucene/commit/73ba88a50dec64f367caa88d277c26dfd1d8883b#diff-75fd48fadfd3d011e9c34c4310ef66e9009edfbc738fd82deb5661a8edb5c5d9
> > > >.
> > > > The only thing that seems to differ is that the characters are full-width
> > > > vs half-width, so I was wondering if this is intended behavior or a
> > > bug/too
> > > > restrictive. Any suggestions for fixing this would be greatly
> > > appreciated!
> > > > Thanks!
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > >
> > >

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org