Hi,
Let me explain a brief background or intention of the change.
Basically, character normalization is not a responsibility of a
tokenizer and should not be performed when you "tokenize" texts.
Instead, there are charFilters and tokenFilters that perform
full-width and half-width normalization.
You can use either one for that purpose:
- CJKWidthCharFilter
(
https://lucene.apache.org/core/9_0_0/analysis/common/org/apache/lucene/analysis/cjk/CJKWidthCharFilter.html)
- CJKWidthFilter
(
https://lucene.apache.org/core/9_0_0/analysis/common/org/apache/lucene/analysis/cjk/CJKWidthFilter.html)
Also, there are more general filters that perform Unicode normalization:
- ICUNormalizer2CharFilter
(
https://lucene.apache.org/core/9_0_0/analysis/icu/org/apache/lucene/analysis/icu/ICUNormalizer2CharFilter.html)
- ICUNormalizer2Filter
(
https://lucene.apache.org/core/9_0_0/analysis/icu/org/apache/lucene/analysis/icu/ICUNormalizer2Filter.html)
I'd recommend charFilters here if you need a general suggestion - in
most use-cases, character normalization should be done before applying
dictionary-based tokenizers such as JapaneseTokenizer.
JapaneseAnalyzer already includes CJKWidthCharFilter since Lucene 9.0
so you don't need to worry about full-width and half-width
normalization if you use it.
Tomoko
2022?1?14?(?) 11:58 Tomoko Uchida <tomoko.uchida.1111@gmail.com>:
>
> Hi,
>
> > The only thing that seems to differ is that the characters are full-width
> > vs half-width, so I was wondering if this is intended behavior or a bug/too
> > restrictive
>
> This is intended behavior. The first column in the user dictionary
> must be equal to the concatenated string of the second column in terms
> of Unicode codepoint. No normalization such as full-width and
> half-width normalization should not be applied (any normalization or
> tweak can cause runtime bugs).
>
> 2022?1?14?(?) 5:45 Marc D'Mello <marcd2000@gmail.com>:
> >
> > Hi Mike,
> >
> > Thanks for the response! I'm actually not super familiar with
> > UserDictionaries, but looking at the code, it seems like a single line in
> > the user provided user dictionary corresponds to a single entry? In that
> > case, here is the line (or entry) that does have both widths that I believe
> > is causing the problem:
> >
> > ??????,?????,?????,JA??
> >
> > I'm guess here the surface is ?????? and the concatentated segment is the
> > first occurrence of ?????. I'm what surface or concatenated segment means
> > though, and what it would mean semantically to replace the surface with the
> > full width version or the concatenated segment with the half width version.
> >
> > Thanks,
> > Marc
> >
> >
> > On Thu, Jan 13, 2022 at 7:18 AM Michael Sokolov <msokolov@gmail.com> wrote:
> >
> > > HI Marc, I wonder if there is a workaround for this issue: eg, could
> > > we have entries for both widths? I wonder if there is some interaction
> > > with an analysis chain that is doing half-width -> full-width
> > > conversion (or vice versa)? I think the UserDictionary has to operate
> > > on pre-analyzed tokens ... although maybe *after* char filtering,
> > > which presumably could handle width conversions. A bunch of rambling,
> > > but maybe the point is - can you share some more information -- what
> > > is the full entry in the dictionary that causes the problem?
> > >
> > > On Wed, Jan 12, 2022 at 7:04 PM Marc D'Mello <marcd2000@gmail.com> wrote:
> > > >
> > > > Hi,
> > > >
> > > > I had a question about the Japanese user dictionary. We have a user
> > > > dictionary that used to work but after attempting to upgrade Lucene, it
> > > > fails with the following error:
> > > >
> > > > Caused by: java.lang.RuntimeException: Illegal user dictionary entry
> > > ??????
> > > > - the concatenated segmentation (?????) does not match the surface form
> > > > (??????)
> > > > at
> > > >
> > > org.apache.lucene.analysis.ja.dict.UserDictionary.<init>(UserDictionary.java:123)
> > > >
> > > > The specific commit causing this error is here
> > > > <
> > > https://github.com/apache/lucene/commit/73ba88a50dec64f367caa88d277c26dfd1d8883b#diff-75fd48fadfd3d011e9c34c4310ef66e9009edfbc738fd82deb5661a8edb5c5d9
> > > >.
> > > > The only thing that seems to differ is that the characters are full-width
> > > > vs half-width, so I was wondering if this is intended behavior or a
> > > bug/too
> > > > restrictive. Any suggestions for fixing this would be greatly
> > > appreciated!
> > > > Thanks!
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > >
> > >
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org