Mailing List Archive

Text search in Arabic
Hello Lucene Community,

I hope this finds you all well. I want to ask you if this would be the right medium to discuss some matters surrounding text search in relation to variant Unicode codings of words in Arabic and Arabic scripted languages. This is not a great example but the said matters are similar to matters around Latin scripted searches where the letter “?” needs to be substituted with “I” in searches and so forth. Would this mailing list be the best medium to discuss such matters? If not, would you mind recommending me a medium for discussion on this?

Kind regards,
Mete Kural
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org
Re: Text search in Arabic [ In reply to ]
Hi Mete

You might also want to try the java-user@lucene.apache.org mailing list

https://lucene.apache.org/core/discussion.html#java-user-list-java-userluceneapacheorg

Re languages other than english you might find more information at

https://cwiki.apache.org/confluence/display/lucene/LuceneFAQ#LuceneFAQ-CanIuseLucenetoindextextinChinese,Japanese,Korean,andothermulti-bytecharactersets?

whereas I just realize that the following link does not work anymore

https://lucene.apache.org/core/lucene-sandbox/

Are these analyzers now inside

https://github.com/apache/lucene/tree/main/lucene/analysis/common/src/java/org/apache/lucene/analysis
https://github.com/apache/lucene/tree/main/lucene/analysis/common/src/java/org/apache/lucene/analysis/ar

?

Thanks

Michael


Am 20.05.21 um 14:48 schrieb Mete Kural:
> Hello Lucene Community,
>
> I hope this finds you all well. I want to ask you if this would be the right medium to discuss some matters surrounding text search in relation to variant Unicode codings of words in Arabic and Arabic scripted languages. This is not a great example but the said matters are similar to matters around Latin scripted searches where the letter “?” needs to be substituted with “I” in searches and so forth. Would this mailing list be the best medium to discuss such matters? If not, would you mind recommending me a medium for discussion on this?
>
> Kind regards,
> Mete Kural
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org
Re: Text search in Arabic [ In reply to ]
?Hello Michael,

Thank you very much for this information.

I will try at java-user@lucene.apache.org also.

By the way, is the Arabic analyzer referenced here (https://github.com/apache/lucene/tree/main/lucene/analysis/common/src/java/org/apache/lucene/analysis/ar) just for the Arabic language or all languages written with the Arabic script?

Thank you,
Mete


> On May 20, 2021, at 4:35 PM, Michael Wechner <michael.wechner@wyona.com> wrote:
>
> Hi Mete
>
> You might also want to try the java-user@lucene.apache.org mailing list
>
> https://lucene.apache.org/core/discussion.html#java-user-list-java-userluceneapacheorg
>
> Re languages other than english you might find more information at
>
> https://cwiki.apache.org/confluence/display/lucene/LuceneFAQ#LuceneFAQ-CanIuseLucenetoindextextinChinese,Japanese,Korean,andothermulti-bytecharactersets?
>
> whereas I just realize that the following link does not work anymore
>
> https://lucene.apache.org/core/lucene-sandbox/
>
> Are these analyzers now inside
>
> https://github.com/apache/lucene/tree/main/lucene/analysis/common/src/java/org/apache/lucene/analysis
> https://github.com/apache/lucene/tree/main/lucene/analysis/common/src/java/org/apache/lucene/analysis/ar
>
> ?
>
> Thanks
>
> Michael
>
>
> Am 20.05.21 um 14:48 schrieb Mete Kural:
>> Hello Lucene Community,
>>
>> I hope this finds you all well. I want to ask you if this would be the right medium to discuss some matters surrounding text search in relation to variant Unicode codings of words in Arabic and Arabic scripted languages. This is not a great example but the said matters are similar to matters around Latin scripted searches where the letter “?” needs to be substituted with “I” in searches and so forth. Would this mailing list be the best medium to discuss such matters? If not, would you mind recommending me a medium for discussion on this?
>>
>> Kind regards,
>> Mete Kural
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: dev-help@lucene.apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
Re: Text search in Arabic [ In reply to ]
Hi,

As answer to your question looking for character substitutions. There is the ICU library doing this with ICU Transformers. It may also change all Cyrillic text to latin during indexing and search. This greatly helps people to find stuff.

A great example of a transformer is here as part of elasticsearch's documentation. I regularly use it when language of text is unknown and can only be tokenized: https://www.elastic.co/guide/en/elasticsearch/plugins/current/analysis-icu-transform.html

The example mentioned there replaces any text with a transformation to latin characters, then decomposes umlauts and accents, strips those accents after the decomposition, and composes the remaining chars again. After that you have tokens in mostly latin without any accents.

You can use this also in Solr or pure Lucene (ICUTransformTokenFilter).

Uwe

Am May 20, 2021 1:35:45 PM UTC schrieb Michael Wechner <michael.wechner@wyona.com>:
>Hi Mete
>
>You might also want to try the java-user@lucene.apache.org mailing list
>
>https://lucene.apache.org/core/discussion.html#java-user-list-java-userluceneapacheorg
>
>Re languages other than english you might find more information at
>
>https://cwiki.apache.org/confluence/display/lucene/LuceneFAQ#LuceneFAQ-CanIuseLucenetoindextextinChinese,Japanese,Korean,andothermulti-bytecharactersets?
>
>whereas I just realize that the following link does not work anymore
>
>https://lucene.apache.org/core/lucene-sandbox/
>
>Are these analyzers now inside
>
>https://github.com/apache/lucene/tree/main/lucene/analysis/common/src/java/org/apache/lucene/analysis
>https://github.com/apache/lucene/tree/main/lucene/analysis/common/src/java/org/apache/lucene/analysis/ar
>
>?
>
>Thanks
>
>Michael
>
>
>Am 20.05.21 um 14:48 schrieb Mete Kural:
>> Hello Lucene Community,
>>
>> I hope this finds you all well. I want to ask you if this would be
>the right medium to discuss some matters surrounding text search in
>relation to variant Unicode codings of words in Arabic and Arabic
>scripted languages. This is not a great example but the said matters
>are similar to matters around Latin scripted searches where the letter
>“?” needs to be substituted with “I” in searches and so forth. Would
>this mailing list be the best medium to discuss such matters? If not,
>would you mind recommending me a medium for discussion on this?
>>
>> Kind regards,
>> Mete Kural
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: dev-help@lucene.apache.org
>>
>
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>For additional commands, e-mail: dev-help@lucene.apache.org

--
Uwe Schindler
Achterdiek 19, 28357 Bremen
https://www.thetaphi.de
Re: Text search in Arabic [ In reply to ]
This is only for Arabic language.

If you don't know the language and just want to assist people searching with different scripts (search with latin letters for Arabic text), see my other answer.

Uwe

Am May 20, 2021 2:38:26 PM UTC schrieb Mete Kural <metekural@icloud.com.INVALID>:
>?Hello Michael,
>
>Thank you very much for this information.
>
>I will try at java-user@lucene.apache.org also.
>
>By the way, is the Arabic analyzer referenced here
>(https://github.com/apache/lucene/tree/main/lucene/analysis/common/src/java/org/apache/lucene/analysis/ar)
>just for the Arabic language or all languages written with the Arabic
>script?
>
>Thank you,
>Mete
>
>
>> On May 20, 2021, at 4:35 PM, Michael Wechner
><michael.wechner@wyona.com> wrote:
>>
>> Hi Mete
>>
>> You might also want to try the java-user@lucene.apache.org mailing
>list
>>
>>
>https://lucene.apache.org/core/discussion.html#java-user-list-java-userluceneapacheorg
>>
>> Re languages other than english you might find more information at
>>
>>
>https://cwiki.apache.org/confluence/display/lucene/LuceneFAQ#LuceneFAQ-CanIuseLucenetoindextextinChinese,Japanese,Korean,andothermulti-bytecharactersets?
>>
>> whereas I just realize that the following link does not work anymore
>>
>> https://lucene.apache.org/core/lucene-sandbox/
>>
>> Are these analyzers now inside
>>
>>
>https://github.com/apache/lucene/tree/main/lucene/analysis/common/src/java/org/apache/lucene/analysis
>>
>https://github.com/apache/lucene/tree/main/lucene/analysis/common/src/java/org/apache/lucene/analysis/ar
>>
>> ?
>>
>> Thanks
>>
>> Michael
>>
>>
>> Am 20.05.21 um 14:48 schrieb Mete Kural:
>>> Hello Lucene Community,
>>>
>>> I hope this finds you all well. I want to ask you if this would be
>the right medium to discuss some matters surrounding text search in
>relation to variant Unicode codings of words in Arabic and Arabic
>scripted languages. This is not a great example but the said matters
>are similar to matters around Latin scripted searches where the letter
>“?” needs to be substituted with “I” in searches and so forth. Would
>this mailing list be the best medium to discuss such matters? If not,
>would you mind recommending me a medium for discussion on this?
>>>
>>> Kind regards,
>>> Mete Kural
>>>
>---------------------------------------------------------------------
>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: dev-help@lucene.apache.org
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: dev-help@lucene.apache.org

--
Uwe Schindler
Achterdiek 19, 28357 Bremen
https://www.thetaphi.de
Re: Text search in Arabic [ In reply to ]
I recommend normalizing all characters with a compatibility transformation, whether they are Arabic or not.

We use this charFilter as the first step in every query and indexing analysis chain.

<charFilter class="solr.ICUNormalizer2CharFilterFactory"/>

You’ll also need to include the ICU library, which should be included by default. Actually, the compatbility normalization should be done by default, too. That transform was designed specifically for string matching and search.

We have this in every solrconfig.xml.

<!-- extras for ICU-based Unicode normalization -->
<lib dir="${solr.install.dir:../../../..}/contrib/analysis-extras/lib/" regex=".*\.jar" />
<lib dir="${solr.install.dir:../../../..}/contrib/analysis-extras/lucene-libs" regex=".*\.jar" />

wunder
Walter Underwood
wunder@wunderwood.org
http://observer.wunderwood.org/ (my blog)

> On May 20, 2021, at 9:38 AM, Mete Kural <metekural@icloud.com.INVALID> wrote:
>
> ?Hello Michael,
>
> Thank you very much for this information.
>
> I will try at java-user@lucene.apache.org <mailto:java-user@lucene.apache.org> also.
>
> By the way, is the Arabic analyzer referenced here (https://github.com/apache/lucene/tree/main/lucene/analysis/common/src/java/org/apache/lucene/analysis/ar) just for the Arabic language or all languages written with the Arabic script?
>
> Thank you,
> Mete
>
>
>> On May 20, 2021, at 4:35 PM, Michael Wechner <michael.wechner@wyona.com> wrote:
>>
>> Hi Mete
>>
>> You might also want to try the java-user@lucene.apache.org mailing list
>>
>> https://lucene.apache.org/core/discussion.html#java-user-list-java-userluceneapacheorg
>>
>> Re languages other than english you might find more information at
>>
>> https://cwiki.apache.org/confluence/display/lucene/LuceneFAQ#LuceneFAQ-CanIuseLucenetoindextextinChinese,Japanese,Korean,andothermulti-bytecharactersets?
>>
>> whereas I just realize that the following link does not work anymore
>>
>> https://lucene.apache.org/core/lucene-sandbox/
>>
>> Are these analyzers now inside
>>
>> https://github.com/apache/lucene/tree/main/lucene/analysis/common/src/java/org/apache/lucene/analysis
>> https://github.com/apache/lucene/tree/main/lucene/analysis/common/src/java/org/apache/lucene/analysis/ar
>>
>> ?
>>
>> Thanks
>>
>> Michael
>>
>>
>> Am 20.05.21 um 14:48 schrieb Mete Kural:
>>> Hello Lucene Community,
>>>
>>> I hope this finds you all well. I want to ask you if this would be the right medium to discuss some matters surrounding text search in relation to variant Unicode codings of words in Arabic and Arabic scripted languages. This is not a great example but the said matters are similar to matters around Latin scripted searches where the letter “?” needs to be substituted with “I” in searches and so forth. Would this mailing list be the best medium to discuss such matters? If not, would you mind recommending me a medium for discussion on this?
>>>
>>> Kind regards,
>>> Mete Kural
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: dev-help@lucene.apache.org
>>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: dev-help@lucene.apache.org
>>
>
Re: Text search in Arabic [ In reply to ]
?Thank you for all this information Uwe and Walter!

Let me digest this information and education myself on these matters and figure out a way forward.

Have a great one,
Mete


> On May 20, 2021, at 6:43 PM, Walter Underwood <wunder@wunderwood.org> wrote:
>
> I recommend normalizing all characters with a compatibility transformation, whether they are Arabic or not.
>
> We use this charFilter as the first step in every query and indexing analysis chain.
>
> <charFilter class="solr.ICUNormalizer2CharFilterFactory"/>
>
> You’ll also need to include the ICU library, which should be included by default. Actually, the compatbility normalization should be done by default, too. That transform was designed specifically for string matching and search.
>
> We have this in every solrconfig.xml.
>
> <!-- extras for ICU-based Unicode normalization -->
> <lib dir="${solr.install.dir:../../../..}/contrib/analysis-extras/lib/" regex=".*\.jar" />
> <lib dir="${solr.install.dir:../../../..}/contrib/analysis-extras/lucene-libs" regex=".*\.jar" />
>
> wunder
> Walter Underwood
> wunder@wunderwood.org
> http://observer.wunderwood.org/ (my blog)
>
>> On May 20, 2021, at 9:38 AM, Mete Kural <metekural@icloud.com.INVALID> wrote:
>>
>> ?Hello Michael,
>>
>> Thank you very much for this information.
>>
>> I will try at java-user@lucene.apache.org also.
>>
>> By the way, is the Arabic analyzer referenced here (https://github.com/apache/lucene/tree/main/lucene/analysis/common/src/java/org/apache/lucene/analysis/ar) just for the Arabic language or all languages written with the Arabic script?
>>
>> Thank you,
>> Mete
>>
>>
>>> On May 20, 2021, at 4:35 PM, Michael Wechner <michael.wechner@wyona.com> wrote:
>>>
>>> Hi Mete
>>>
>>> You might also want to try the java-user@lucene.apache.org mailing list
>>>
>>> https://lucene.apache.org/core/discussion.html#java-user-list-java-userluceneapacheorg
>>>
>>> Re languages other than english you might find more information at
>>>
>>> https://cwiki.apache.org/confluence/display/lucene/LuceneFAQ#LuceneFAQ-CanIuseLucenetoindextextinChinese,Japanese,Korean,andothermulti-bytecharactersets?
>>>
>>> whereas I just realize that the following link does not work anymore
>>>
>>> https://lucene.apache.org/core/lucene-sandbox/
>>>
>>> Are these analyzers now inside
>>>
>>> https://github.com/apache/lucene/tree/main/lucene/analysis/common/src/java/org/apache/lucene/analysis
>>> https://github.com/apache/lucene/tree/main/lucene/analysis/common/src/java/org/apache/lucene/analysis/ar
>>>
>>> ?
>>>
>>> Thanks
>>>
>>> Michael
>>>
>>>
>>> Am 20.05.21 um 14:48 schrieb Mete Kural:
>>>> Hello Lucene Community,
>>>>
>>>> I hope this finds you all well. I want to ask you if this would be the right medium to discuss some matters surrounding text search in relation to variant Unicode codings of words in Arabic and Arabic scripted languages. This is not a great example but the said matters are similar to matters around Latin scripted searches where the letter “?” needs to be substituted with “I” in searches and so forth. Would this mailing list be the best medium to discuss such matters? If not, would you mind recommending me a medium for discussion on this?
>>>>
>>>> Kind regards,
>>>> Mete Kural
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: dev-help@lucene.apache.org
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: dev-help@lucene.apache.org