Mailing List Archive: Ascii folding

Ascii folding

dawid.weiss at gmail

Nov 10, 2023, 9:19 AM

Post #1 of 8 (169 views)

I just stumbled upon this stop word appearing in one of our indexes:

th?

Look closely. Can you see it? I doubt - I couldn't either. This is the hex
dump of that:

74 68 d0 b5

which means

th? and the

are two different things.

Here's the unicode letter after "th":
https://www.fileformat.info/info/unicode/char/0435/index.htm

To my surprise, I couldn't find it in the ascii folding filter:

https://github.com/apache/lucene/blob/main/lucene/analysis/common/src/java/org/apache/lucene/analysis/miscellaneous/ASCIIFoldingFilter.java

Anybody remembers whether the omission of Cyrillic characters was
intentional (there is quite a few of them that are nearly identical in
appearance to Latin letters).

Dawid

Re: Ascii folding [ In reply to ]

sarowe at gmail

Nov 10, 2023, 9:56 AM

Post #2 of 8 (169 views)

Hi Dawid,

When I contributed to this class, I thought it was about the “looks like” relation (between source and target chars), so it would make sense to me to add Cyrillic.[1]

However, if you look at the other comments in that issue[1], you can see that there are conflicting language-specific issues that can arise, mostly(?) about “sounds like” or existing-language-specific-ascii-substitution relations, rather than simply “looks like”.

So IIRC, I excluded language-specific code blocks to avoid controversy like ^ , only phonetic blocks and Latin-specific blocks were included[2].

Steve

[1] https://issues.apache.org/jira/browse/LUCENE-1390?focusedCommentId=12635607&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-12635607
[2] https://lucene.apache.org/core/9_8_0/analysis/common/org/apache/lucene/analysis/miscellaneous/ASCIIFoldingFilter.html

> On Nov 10, 2023, at 12:19 PM, Dawid Weiss <dawid.weiss@gmail.com> wrote:
>
>
> I just stumbled upon this stop word appearing in one of our indexes:
>
> th?
>
> Look closely. Can you see it? I doubt - I couldn't either. This is the hex dump of that:
>
> 74 68 d0 b5
>
> which means
>
> th? and the
>
> are two different things.
>
> Here's the unicode letter after "th":
> https://www.fileformat.info/info/unicode/char/0435/index.htm
>
> To my surprise, I couldn't find it in the ascii folding filter:
>
> https://github.com/apache/lucene/blob/main/lucene/analysis/common/src/java/org/apache/lucene/analysis/miscellaneous/ASCIIFoldingFilter.java
>
> Anybody remembers whether the omission of Cyrillic characters was intentional (there is quite a few of them that are nearly identical in appearance to Latin letters).
>
> Dawid

Re: Ascii folding [ In reply to ]

hossman_lucene at fucit

Nov 10, 2023, 9:57 AM

Post #3 of 8 (169 views)

: Here's the unicode letter after "th":
: https://www.fileformat.info/info/unicode/char/0435/index.htm
:
: To my surprise, I couldn't find it in the ascii folding filter:
:
: https://github.com/apache/lucene/blob/main/lucene/analysis/common/src/java/org/apache/lucene/analysis/miscellaneous/ASCIIFoldingFilter.java
:
: Anybody remembers whether the omission of Cyrillic characters was
: intentional (there is quite a few of them that are nearly identical in
: appearance to Latin letters).

From the javadocs, i'm going to guess it's because the the filter focuses
on "Latin_characters_in_Unicode" ... and your "CYRILLIC SMALL LETTER IE"
isn't described as being a "(adjective) LATIN noun (WITH noun)" like all
of the other characters that are considered to have a direct mapping to
the "ASCII" / latin characters.

If you look back at when it was added...

https://issues.apache.org/jira/browse/LUCENE-1390

...the original focus was on deprecating "ISOLatin1AccentFilter" and
replacing it with "a more comprehensive version of this code that included
not just ISO-Latin-1 (ISO-8859-1) but the entire Latin 1 and Latin
Extended A unicode blocks." (The originally proposed name was
'ISOLatinAccentFilter') ... subsequent discussion focused on adding more
Latin blocks.

There was a related issue at the time which initially aimed to add a
more general "UnicodeNormalizationFilter" that ultimated resulted in
adding the "ICU" analysis classes...

https://issues.apache.org/jira/browse/LUCENE-1343

..which IIUC may better handle "CYRILLIC SMALL LETTER IE" (but i haven't
tested that)

-Hoss
http://www.lucidworks.com/

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: Ascii folding [ In reply to ]

dawid.weiss at gmail

Nov 10, 2023, 10:03 AM

Post #4 of 8 (169 views)

Hi Steve, Chris,

Ok, makes sense. Thanks for the pointers. I agree the justification for the
use of character-level normalization filters is highly context-dependent
(for example, unsuitable when mixed languages are present on input).

Dawid

On Fri, Nov 10, 2023 at 6:58?PM Chris Hostetter <hossman_lucene@fucit.org>
wrote:

>
> : Here's the unicode letter after "th":
> : https://www.fileformat.info/info/unicode/char/0435/index.htm
> :
> : To my surprise, I couldn't find it in the ascii folding filter:
> :
> :
> https://github.com/apache/lucene/blob/main/lucene/analysis/common/src/java/org/apache/lucene/analysis/miscellaneous/ASCIIFoldingFilter.java
> :
> : Anybody remembers whether the omission of Cyrillic characters was
> : intentional (there is quite a few of them that are nearly identical in
> : appearance to Latin letters).
>
> From the javadocs, i'm going to guess it's because the the filter focuses
> on "Latin_characters_in_Unicode" ... and your "CYRILLIC SMALL LETTER IE"
> isn't described as being a "(adjective) LATIN noun (WITH noun)" like all
> of the other characters that are considered to have a direct mapping to
> the "ASCII" / latin characters.
>
> If you look back at when it was added...
>
> https://issues.apache.org/jira/browse/LUCENE-1390
>
> ...the original focus was on deprecating "ISOLatin1AccentFilter" and
> replacing it with "a more comprehensive version of this code that included
> not just ISO-Latin-1 (ISO-8859-1) but the entire Latin 1 and Latin
> Extended A unicode blocks." (The originally proposed name was
> 'ISOLatinAccentFilter') ... subsequent discussion focused on adding more
> Latin blocks.
>
> There was a related issue at the time which initially aimed to add a
> more general "UnicodeNormalizationFilter" that ultimated resulted in
> adding the "ICU" analysis classes...
>
> https://issues.apache.org/jira/browse/LUCENE-1343
>
> ..which IIUC may better handle "CYRILLIC SMALL LETTER IE" (but i haven't
> tested that)
>
>
>
> -Hoss
> http://www.lucidworks.com/
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>

Re: Ascii folding [ In reply to ]

rcmuir at gmail

Nov 10, 2023, 10:13 AM

Post #5 of 8 (169 views)

For visual confusing characters we have the option to expose specific
processing for that, e.g.
https://unicode-org.github.io/icu-docs/apidoc/dev/icu4j/com/ibm/icu/text/SpoofChecker.html#getSkeleton-java.lang.CharSequence-

Maybe there are use-cases for a search engine, e.g. find me documents
with words that "could be confused visually" with 'beer' (or whatever
the query is). Usually this processing is geared around security
use-cases.

On Fri, Nov 10, 2023 at 1:03?PM Dawid Weiss <dawid.weiss@gmail.com> wrote:
>
>
> Hi Steve, Chris,
>
> Ok, makes sense. Thanks for the pointers. I agree the justification for the use of character-level normalization filters is highly context-dependent (for example, unsuitable when mixed languages are present on input).
>
> Dawid
>
> On Fri, Nov 10, 2023 at 6:58?PM Chris Hostetter <hossman_lucene@fucit.org> wrote:
>>
>>
>> : Here's the unicode letter after "th":
>> : https://www.fileformat.info/info/unicode/char/0435/index.htm
>> :
>> : To my surprise, I couldn't find it in the ascii folding filter:
>> :
>> : https://github.com/apache/lucene/blob/main/lucene/analysis/common/src/java/org/apache/lucene/analysis/miscellaneous/ASCIIFoldingFilter.java
>> :
>> : Anybody remembers whether the omission of Cyrillic characters was
>> : intentional (there is quite a few of them that are nearly identical in
>> : appearance to Latin letters).
>>
>> From the javadocs, i'm going to guess it's because the the filter focuses
>> on "Latin_characters_in_Unicode" ... and your "CYRILLIC SMALL LETTER IE"
>> isn't described as being a "(adjective) LATIN noun (WITH noun)" like all
>> of the other characters that are considered to have a direct mapping to
>> the "ASCII" / latin characters.
>>
>> If you look back at when it was added...
>>
>> https://issues.apache.org/jira/browse/LUCENE-1390
>>
>> ...the original focus was on deprecating "ISOLatin1AccentFilter" and
>> replacing it with "a more comprehensive version of this code that included
>> not just ISO-Latin-1 (ISO-8859-1) but the entire Latin 1 and Latin
>> Extended A unicode blocks." (The originally proposed name was
>> 'ISOLatinAccentFilter') ... subsequent discussion focused on adding more
>> Latin blocks.
>>
>> There was a related issue at the time which initially aimed to add a
>> more general "UnicodeNormalizationFilter" that ultimated resulted in
>> adding the "ICU" analysis classes...
>>
>> https://issues.apache.org/jira/browse/LUCENE-1343
>>
>> ..which IIUC may better handle "CYRILLIC SMALL LETTER IE" (but i haven't
>> tested that)
>>
>>
>>
>> -Hoss
>> http://www.lucidworks.com/
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: dev-help@lucene.apache.org
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: Ascii folding [ In reply to ]

rcmuir at gmail

Nov 10, 2023, 10:22 AM

Post #6 of 8 (167 views)

Sorry, I meant to provide the demo link too, in case you want to play:
https://util.unicode.org/UnicodeJsps/confusables.jsp?a=paypal&r=None

It illustrates how the problem of "visually confusing" is really its
own beast, e.g. confusion of 'L' vs '1' with some fonts.

On Fri, Nov 10, 2023 at 1:13?PM Robert Muir <rcmuir@gmail.com> wrote:
>
> For visual confusing characters we have the option to expose specific
> processing for that, e.g.
> https://unicode-org.github.io/icu-docs/apidoc/dev/icu4j/com/ibm/icu/text/SpoofChecker.html#getSkeleton-java.lang.CharSequence-
>
> Maybe there are use-cases for a search engine, e.g. find me documents
> with words that "could be confused visually" with 'beer' (or whatever
> the query is). Usually this processing is geared around security
> use-cases.
>
> On Fri, Nov 10, 2023 at 1:03?PM Dawid Weiss <dawid.weiss@gmail.com> wrote:
> >
> >
> > Hi Steve, Chris,
> >
> > Ok, makes sense. Thanks for the pointers. I agree the justification for the use of character-level normalization filters is highly context-dependent (for example, unsuitable when mixed languages are present on input).
> >
> > Dawid
> >
> > On Fri, Nov 10, 2023 at 6:58?PM Chris Hostetter <hossman_lucene@fucit.org> wrote:
> >>
> >>
> >> : Here's the unicode letter after "th":
> >> : https://www.fileformat.info/info/unicode/char/0435/index.htm
> >> :
> >> : To my surprise, I couldn't find it in the ascii folding filter:
> >> :
> >> : https://github.com/apache/lucene/blob/main/lucene/analysis/common/src/java/org/apache/lucene/analysis/miscellaneous/ASCIIFoldingFilter.java
> >> :
> >> : Anybody remembers whether the omission of Cyrillic characters was
> >> : intentional (there is quite a few of them that are nearly identical in
> >> : appearance to Latin letters).
> >>
> >> From the javadocs, i'm going to guess it's because the the filter focuses
> >> on "Latin_characters_in_Unicode" ... and your "CYRILLIC SMALL LETTER IE"
> >> isn't described as being a "(adjective) LATIN noun (WITH noun)" like all
> >> of the other characters that are considered to have a direct mapping to
> >> the "ASCII" / latin characters.
> >>
> >> If you look back at when it was added...
> >>
> >> https://issues.apache.org/jira/browse/LUCENE-1390
> >>
> >> ...the original focus was on deprecating "ISOLatin1AccentFilter" and
> >> replacing it with "a more comprehensive version of this code that included
> >> not just ISO-Latin-1 (ISO-8859-1) but the entire Latin 1 and Latin
> >> Extended A unicode blocks." (The originally proposed name was
> >> 'ISOLatinAccentFilter') ... subsequent discussion focused on adding more
> >> Latin blocks.
> >>
> >> There was a related issue at the time which initially aimed to add a
> >> more general "UnicodeNormalizationFilter" that ultimated resulted in
> >> adding the "ICU" analysis classes...
> >>
> >> https://issues.apache.org/jira/browse/LUCENE-1343
> >>
> >> ..which IIUC may better handle "CYRILLIC SMALL LETTER IE" (but i haven't
> >> tested that)
> >>
> >>
> >>
> >> -Hoss
> >> http://www.lucidworks.com/
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: dev-help@lucene.apache.org
> >>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: Ascii folding [ In reply to ]

uwe at thetaphi

Nov 11, 2023, 5:02 AM

Post #7 of 8 (165 views)

Hi Dawid,

the ASCII folding filter is meant to remove accents. You would like to
have searching for visually similar characters. These are 2 different
things.

Actually Robert also has some config options, waht I generally use for
wester european searches where some documents may contain names of
people (Author names, titles in cyrillic or other languages) it to
convert the tokens using ICU transliteration (use one of the ICU folding
filters with the below config):

Transliterator.getInstance("Any-Latin; NFD; [:Nonspacing Mark:] Remove;
NFKC; CaseFold", Transliterator.FORWARD);

This does convert everything to latin characters in a language-neutral
way and then removes all accents by the trick "decompose, remove
non-spacing mark, compose again and case-fold the result.

Uwe

Am 10.11.2023 um 19:03 schrieb Dawid Weiss:
>
> Hi Steve, Chris,
>
> Ok, makes sense. Thanks for the pointers. I agree the justification
> for the use of character-level normalization filters is highly
> context-dependent (for example, unsuitable when mixed languages are
> present on input).
>
> Dawid
>
> On Fri, Nov 10, 2023 at 6:58?PM Chris Hostetter
> <hossman_lucene@fucit.org> wrote:
>
>
> : Here's the unicode letter after "th":
> : https://www.fileformat.info/info/unicode/char/0435/index.htm
> :
> : To my surprise, I couldn't find it in the ascii folding filter:
> :
> :
> https://github.com/apache/lucene/blob/main/lucene/analysis/common/src/java/org/apache/lucene/analysis/miscellaneous/ASCIIFoldingFilter.java
> :
> : Anybody remembers whether the omission of Cyrillic characters was
> : intentional (there is quite a few of them that are nearly
> identical in
> : appearance to Latin letters).
>
> From the javadocs, i'm going to guess it's because the the filter
> focuses
> on "Latin_characters_in_Unicode" ... and your "CYRILLIC SMALL
> LETTER IE"
> isn't described as being a "(adjective) LATIN noun (WITH noun)"
> like all
> of the other characters that are considered to have a direct
> mapping to
> the "ASCII" / latin characters.
>
> If you look back at when it was added...
>
> https://issues.apache.org/jira/browse/LUCENE-1390
>
> ...the original focus was on deprecating "ISOLatin1AccentFilter" and
> replacing it with "a more comprehensive version of this code that
> included
> not just ISO-Latin-1 (ISO-8859-1) but the entire Latin 1 and Latin
> Extended A unicode blocks." (The originally proposed name was
> 'ISOLatinAccentFilter') ... subsequent discussion focused on
> adding more
> Latin blocks.
>
> There was a related issue at the time which initially aimed to add a
> more general "UnicodeNormalizationFilter" that ultimated resulted in
> adding the "ICU" analysis classes...
>
> https://issues.apache.org/jira/browse/LUCENE-1343
>
> ..which IIUC may better handle "CYRILLIC SMALL LETTER IE" (but i
> haven't
> tested that)
>
>
>
> -Hoss
> http://www.lucidworks.com/
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
--
Uwe Schindler
Achterdiek 19, D-28357 Bremen
https://www.thetaphi.de
eMail:uwe@thetaphi.de

Re: Ascii folding [ In reply to ]

dawid.weiss at gmail

Nov 12, 2023, 9:12 AM

Post #8 of 8 (165 views)

Thanks Robert, Uwe - all this is enlightening. I didn't know about those
things you mentioned.

Dawid

On Sat, Nov 11, 2023 at 2:02?PM Uwe Schindler <uwe@thetaphi.de> wrote:

> Hi Dawid,
>
> the ASCII folding filter is meant to remove accents. You would like to
> have searching for visually similar characters. These are 2 different
> things.
>
> Actually Robert also has some config options, waht I generally use for
> wester european searches where some documents may contain names of people
> (Author names, titles in cyrillic or other languages) it to convert the
> tokens using ICU transliteration (use one of the ICU folding filters with
> the below config):
>
> Transliterator.getInstance("Any-Latin; NFD; [:Nonspacing Mark:] Remove;
> NFKC; CaseFold", Transliterator.FORWARD);
>
> This does convert everything to latin characters in a language-neutral way
> and then removes all accents by the trick "decompose, remove non-spacing
> mark, compose again and case-fold the result.
>
> Uwe
> Am 10.11.2023 um 19:03 schrieb Dawid Weiss:
>
>
> Hi Steve, Chris,
>
> Ok, makes sense. Thanks for the pointers. I agree the justification for
> the use of character-level normalization filters is highly
> context-dependent (for example, unsuitable when mixed languages are present
> on input).
>
> Dawid
>
> On Fri, Nov 10, 2023 at 6:58?PM Chris Hostetter <hossman_lucene@fucit.org>
> wrote:
>
>>
>> : Here's the unicode letter after "th":
>> : https://www.fileformat.info/info/unicode/char/0435/index.htm
>> :
>> : To my surprise, I couldn't find it in the ascii folding filter:
>> :
>> :
>> https://github.com/apache/lucene/blob/main/lucene/analysis/common/src/java/org/apache/lucene/analysis/miscellaneous/ASCIIFoldingFilter.java
>> :
>> : Anybody remembers whether the omission of Cyrillic characters was
>> : intentional (there is quite a few of them that are nearly identical in
>> : appearance to Latin letters).
>>
>> From the javadocs, i'm going to guess it's because the the filter focuses
>> on "Latin_characters_in_Unicode" ... and your "CYRILLIC SMALL LETTER IE"
>> isn't described as being a "(adjective) LATIN noun (WITH noun)" like all
>> of the other characters that are considered to have a direct mapping to
>> the "ASCII" / latin characters.
>>
>> If you look back at when it was added...
>>
>> https://issues.apache.org/jira/browse/LUCENE-1390
>>
>> ...the original focus was on deprecating "ISOLatin1AccentFilter" and
>> replacing it with "a more comprehensive version of this code that
>> included
>> not just ISO-Latin-1 (ISO-8859-1) but the entire Latin 1 and Latin
>> Extended A unicode blocks." (The originally proposed name was
>> 'ISOLatinAccentFilter') ... subsequent discussion focused on adding more
>> Latin blocks.
>>
>> There was a related issue at the time which initially aimed to add a
>> more general "UnicodeNormalizationFilter" that ultimated resulted in
>> adding the "ICU" analysis classes...
>>
>> https://issues.apache.org/jira/browse/LUCENE-1343
>>
>> ..which IIUC may better handle "CYRILLIC SMALL LETTER IE" (but i haven't
>> tested that)
>>
>>
>>
>> -Hoss
>> http://www.lucidworks.com/
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: dev-help@lucene.apache.org
>>
>> --
> Uwe Schindler
> Achterdiek 19, D-28357 Bremenhttps://www.thetaphi.de
> eMail: uwe@thetaphi.de
>
>