Hi Dawid,
the ASCII folding filter is meant to remove accents. You would like to
have searching for visually similar characters. These are 2 different
things.
Actually Robert also has some config options, waht I generally use for
wester european searches where some documents may contain names of
people (Author names, titles in cyrillic or other languages) it to
convert the tokens using ICU transliteration (use one of the ICU folding
filters with the below config):
Transliterator.getInstance("Any-Latin; NFD; [:Nonspacing Mark:] Remove;
NFKC; CaseFold", Transliterator.FORWARD);
This does convert everything to latin characters in a language-neutral
way and then removes all accents by the trick "decompose, remove
non-spacing mark, compose again and case-fold the result.
Uwe
Am 10.11.2023 um 19:03 schrieb Dawid Weiss:
>
> Hi Steve, Chris,
>
> Ok, makes sense. Thanks for the pointers. I agree the justification
> for the use of character-level normalization filters is highly
> context-dependent (for example, unsuitable when mixed languages are
> present on input).
>
> Dawid
>
> On Fri, Nov 10, 2023 at 6:58?PM Chris Hostetter
> <hossman_lucene@fucit.org> wrote:
>
>
> : Here's the unicode letter after "th":
> : https://www.fileformat.info/info/unicode/char/0435/index.htm
> :
> : To my surprise, I couldn't find it in the ascii folding filter:
> :
> :
> https://github.com/apache/lucene/blob/main/lucene/analysis/common/src/java/org/apache/lucene/analysis/miscellaneous/ASCIIFoldingFilter.java
> :
> : Anybody remembers whether the omission of Cyrillic characters was
> : intentional (there is quite a few of them that are nearly
> identical in
> : appearance to Latin letters).
>
> From the javadocs, i'm going to guess it's because the the filter
> focuses
> on "Latin_characters_in_Unicode" ... and your "CYRILLIC SMALL
> LETTER IE"
> isn't described as being a "(adjective) LATIN noun (WITH noun)"
> like all
> of the other characters that are considered to have a direct
> mapping to
> the "ASCII" / latin characters.
>
> If you look back at when it was added...
>
> https://issues.apache.org/jira/browse/LUCENE-1390
>
> ...the original focus was on deprecating "ISOLatin1AccentFilter" and
> replacing it with "a more comprehensive version of this code that
> included
> not just ISO-Latin-1 (ISO-8859-1) but the entire Latin 1 and Latin
> Extended A unicode blocks." (The originally proposed name was
> 'ISOLatinAccentFilter') ... subsequent discussion focused on
> adding more
> Latin blocks.
>
> There was a related issue at the time which initially aimed to add a
> more general "UnicodeNormalizationFilter" that ultimated resulted in
> adding the "ICU" analysis classes...
>
> https://issues.apache.org/jira/browse/LUCENE-1343
>
> ..which IIUC may better handle "CYRILLIC SMALL LETTER IE" (but i
> haven't
> tested that)
>
>
>
> -Hoss
> http://www.lucidworks.com/
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
--
Uwe Schindler
Achterdiek 19, D-28357 Bremen
https://www.thetaphi.de eMail:uwe@thetaphi.de