Mailing List Archive: Best practice - preparing search term for Lucene

Best practice - preparing search term for Lucene

horvoje at gmail

Sep 22, 2022, 9:37 AM

Post #1 of 8 (570 views)

Hi!

I'm using Hibernate Search / Lucene to index my entities in Spring Boot
aplication.

One thing I'm not sure is how to handle Croatian specific letters.
Croatian language has few additional letters "*?* *?* *?* *?* *?* *?* *š*
*Š* *ž* *Ž*".
Letters "*?* *?*" are commonly replaced with "*dj* *DJ*" when no Croatian
letters available.

In my custom Hibernate bridge there is a step that replaces all Croatian
characters with appropriate ASCII replacements which means "*?*" becomes "
*c*", "*š*" becomes "*s*" and so on.
Later, when user enters search text, the same process is done to match
values from index.
There is one more good thing about it - some older users that used
computers in early ages when no Croatian letters were available - those
users type words without Croatian letters, automatically replacing "*?*" with
"*c*" and that fits my logic to get good search results.

For example, the title of my entity is: "*juha s ?ešnjakom u ?umbirom*".
My custom Hibernate String bridge converts it to "*juha cesnjakom dumbirom*
".
Then user enters "*juha s ?ešnjakom*".
Before issuing a search, the same conversion is made to users' query and
text sent to Lucene is "*juha cesnjakom*".
This is the way how I implemented it and it's working fine.

The other way would be to index original text and then find words with
Croatian characters, convert them to ASCII and add to original.
The title "*juha s ?ešnjakom i ?umbirom*" would become "*juha ?ešnjakom
?umbirom cesnjakom dumbirom*".
In that case there is no need to convert users' search terms because
both "*juha
s ?ešnjakom*" and "*juha s cesnjakom*" would return the same result.

My question is:
Is there any reason to switch to this alternative logic and have original
keywords indexed in parallel with those converted to ASCII?

Thanks!

BR,
Hrvoje

Re: Best practice - preparing search term for Lucene [ In reply to ]

passignat at hotmail

Sep 22, 2022, 10:52 AM

Post #2 of 8 (570 views)

Hello,

The way I did it took me some time and I almost sure it's applicable to all languages.

I normalized the words. Replacing letters or group of letters by another approaching one.

In french e é è ê ai ei sound a bit the same, and for someone who write mistakes having to use the right letters is very frustrating. So I transformed all of them into e...

Hope it helps

Télécharger BlueMail pour Android<https://bluemail.me>
Le 22 sept. 2022, à 16:37, "Hrvoje Lon?ar" <horvoje@gmail.com<mailto:horvoje@gmail.com>> a écrit:

Hi!

I'm using Hibernate Search / Lucene to index my entities in Spring Boot
aplication.

One thing I'm not sure is how to handle Croatian specific letters.
Croatian language has few additional letters "*?* *?* *?* *?* *?* *?* *š*
*Š* *ž* *Ž*".
Letters "*?* *?*" are commonly replaced with "*dj* *DJ*" when no Croatian
letters available.

In my custom Hibernate bridge there is a step that replaces all Croatian
characters with appropriate ASCII replacements which means "*?*" becomes "
*c*", "*š*" becomes "*s*" and so on.
Later, when user enters search text, the same process is done to match
values from index.
There is one more good thing about it - some older users that used
computers in early ages when no Croatian letters were available - those
users type words without Croatian letters, automatically replacing "*?*" with
"*c*" and that fits my logic to get good search results.

For example, the title of my entity is: "*juha s ?ešnjakom u ?umbirom*".
My custom Hibernate String bridge converts it to "*juha cesnjakom dumbirom*
".
Then user enters "*juha s ?ešnjakom*".
Before issuing a search, the same conversion is made to users' query and
text sent to Lucene is "*juha cesnjakom*".
This is the way how I implemented it and it's working fine.

The other way would be to index original text and then find words with
Croatian characters, convert them to ASCII and add to original.
The title "*juha s ?ešnjakom i ?umbirom*" would become "*juha ?ešnjakom
?umbirom cesnjakom dumbirom*".
In that case there is no need to convert users' search terms because
both "*juha
s ?ešnjakom*" and "*juha s cesnjakom*" would return the same result.

My question is:
Is there any reason to switch to this alternative logic and have original
keywords indexed in parallel with those converted to ASCII?

Thanks!

BR,
Hrvoje

Re: Best practice - preparing search term for Lucene [ In reply to ]

horvoje at gmail

Sep 23, 2022, 2:28 AM

Post #3 of 8 (570 views)

Hi Stephane!

Actually, I have excactly that kind of conversion, but I didn't mention as
my mail was long enough whithout it :)
My main concern it should I let Lucene index original keywords or not.
Considering what you wrote, I guess your answer would be to store only
converted values without exotic characters.

Thanks a lot for your reply!

BR,
Hrvoje

On Thu, Sep 22, 2022 at 7:53 PM Stephane Passignat <passignat@hotmail.com>
wrote:

> Hello,
>
> The way I did it took me some time and I almost sure it's applicable to
> all languages.
>
> I normalized the words. Replacing letters or group of letters by another
> approaching one.
>
> In french e é è ê ai ei sound a bit the same, and for someone who write
> mistakes having to use the right letters is very frustrating. So I
> transformed all of them into e...
>
> Hope it helps
>
> Télécharger BlueMail pour Android<https://bluemail.me>
> Le 22 sept. 2022, à 16:37, "Hrvoje Lon?ar" <horvoje@gmail.com<mailto:
> horvoje@gmail.com>> a écrit:
>
> Hi!
>
> I'm using Hibernate Search / Lucene to index my entities in Spring Boot
> aplication.
>
> One thing I'm not sure is how to handle Croatian specific letters.
> Croatian language has few additional letters "*?* *?* *?* *?* *?* *?* *š*
> *Š* *ž* *Ž*".
> Letters "*?* *?*" are commonly replaced with "*dj* *DJ*" when no Croatian
> letters available.
>
> In my custom Hibernate bridge there is a step that replaces all Croatian
> characters with appropriate ASCII replacements which means "*?*" becomes "
> *c*", "*š*" becomes "*s*" and so on.
> Later, when user enters search text, the same process is done to match
> values from index.
> There is one more good thing about it - some older users that used
> computers in early ages when no Croatian letters were available - those
> users type words without Croatian letters, automatically replacing "*?*"
> with
> "*c*" and that fits my logic to get good search results.
>
> For example, the title of my entity is: "*juha s ?ešnjakom u ?umbirom*".
> My custom Hibernate String bridge converts it to "*juha cesnjakom dumbirom*
> ".
> Then user enters "*juha s ?ešnjakom*".
> Before issuing a search, the same conversion is made to users' query and
> text sent to Lucene is "*juha cesnjakom*".
> This is the way how I implemented it and it's working fine.
>
> The other way would be to index original text and then find words with
> Croatian characters, convert them to ASCII and add to original.
> The title "*juha s ?ešnjakom i ?umbirom*" would become "*juha ?ešnjakom
> ?umbirom cesnjakom dumbirom*".
> In that case there is no need to convert users' search terms because
> both "*juha
> s ?ešnjakom*" and "*juha s cesnjakom*" would return the same result.
>
> My question is:
> Is there any reason to switch to this alternative logic and have original
> keywords indexed in parallel with those converted to ASCII?
>
> Thanks!
>
> BR,
> Hrvoje
>

--
*{{ **Horvoje.net <https://horvoje.net/> ~~ **VegCook.net
<https://vegcook.net/>* *~~* *TheVegCat.com
<https://thevegcat.com:9999/> ~~ **Cuspajz.com <https://cuspajz.com/>
~~ VintageZagreb.net <https://vintagezagreb.net/> ~~ **Sterilizacija.org
<https://sterilizacija.org/> **~~* *SmijSe.com <https://smijse.com/>
~~ **HTMLutil.net
<https://htmlutil.net/> ~~ HTTPinfo.net <https://httpinfo.net/> }}*

Re: Best practice - preparing search term for Lucene [ In reply to ]

msokolov at gmail

Sep 23, 2022, 9:01 AM

Post #4 of 8 (570 views)

I think it depends how precise you want to make the search. If you
want to enable diacritic-sensitive search in order to avoid confusions
when users actually are able to enter the diacritics, you can index
both ways (ascii-folded and not folded) and not normalize the query
terms. Or you can just fold everything and not worry about it. In
French I know there are confusable words like "cote" which has at
least a few different meanings depending on the accents. Not sure how
it is in Croatian.

On Fri, Sep 23, 2022 at 5:30 AM Hrvoje Lon?ar <horvoje@gmail.com> wrote:
>
> Hi Stephane!
>
> Actually, I have excactly that kind of conversion, but I didn't mention as my mail was long enough whithout it :)
> My main concern it should I let Lucene index original keywords or not.
> Considering what you wrote, I guess your answer would be to store only converted values without exotic characters.
>
> Thanks a lot for your reply!
>
> BR,
> Hrvoje
>
> On Thu, Sep 22, 2022 at 7:53 PM Stephane Passignat <passignat@hotmail.com> wrote:
>>
>> Hello,
>>
>> The way I did it took me some time and I almost sure it's applicable to all languages.
>>
>> I normalized the words. Replacing letters or group of letters by another approaching one.
>>
>> In french e é è ê ai ei sound a bit the same, and for someone who write mistakes having to use the right letters is very frustrating. So I transformed all of them into e...
>>
>> Hope it helps
>>
>> Télécharger BlueMail pour Android<https://bluemail.me>
>> Le 22 sept. 2022, à 16:37, "Hrvoje Lon?ar" <horvoje@gmail.com<mailto:horvoje@gmail.com>> a écrit:
>>
>> Hi!
>>
>> I'm using Hibernate Search / Lucene to index my entities in Spring Boot
>> aplication.
>>
>> One thing I'm not sure is how to handle Croatian specific letters.
>> Croatian language has few additional letters "*?* *?* *?* *?* *?* *?* *š*
>> *Š* *ž* *Ž*".
>> Letters "*?* *?*" are commonly replaced with "*dj* *DJ*" when no Croatian
>> letters available.
>>
>> In my custom Hibernate bridge there is a step that replaces all Croatian
>> characters with appropriate ASCII replacements which means "*?*" becomes "
>> *c*", "*š*" becomes "*s*" and so on.
>> Later, when user enters search text, the same process is done to match
>> values from index.
>> There is one more good thing about it - some older users that used
>> computers in early ages when no Croatian letters were available - those
>> users type words without Croatian letters, automatically replacing "*?*" with
>> "*c*" and that fits my logic to get good search results.
>>
>> For example, the title of my entity is: "*juha s ?ešnjakom u ?umbirom*".
>> My custom Hibernate String bridge converts it to "*juha cesnjakom dumbirom*
>> ".
>> Then user enters "*juha s ?ešnjakom*".
>> Before issuing a search, the same conversion is made to users' query and
>> text sent to Lucene is "*juha cesnjakom*".
>> This is the way how I implemented it and it's working fine.
>>
>> The other way would be to index original text and then find words with
>> Croatian characters, convert them to ASCII and add to original.
>> The title "*juha s ?ešnjakom i ?umbirom*" would become "*juha ?ešnjakom
>> ?umbirom cesnjakom dumbirom*".
>> In that case there is no need to convert users' search terms because
>> both "*juha
>> s ?ešnjakom*" and "*juha s cesnjakom*" would return the same result.
>>
>> My question is:
>> Is there any reason to switch to this alternative logic and have original
>> keywords indexed in parallel with those converted to ASCII?
>>
>> Thanks!
>>
>> BR,
>> Hrvoje
>
>
>
> --
> {{ Horvoje.net ~~ VegCook.net ~~ TheVegCat.com ~~ Cuspajz.com ~~ VintageZagreb.net ~~ Sterilizacija.org ~~ SmijSe.com ~~ HTMLutil.net ~~ HTTPinfo.net }}
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Best practice - preparing search term for Lucene [ In reply to ]

passignat at hotmail

Sep 23, 2022, 10:25 AM

Post #5 of 8 (570 views)

Hi

I would don't store the original value. That's "just" an index. But store the value of your db identifiers, because I think you'll want it at some point. (I made the same kind of feature on top of datanucleus)

I use to have tech id in my db. Even more since I started to use jdo jpa some 20 years ago.

With Lucerne I would also suggest to store a pretty view on entities. This allows to have the ready to display info without querying the db.
As you won't be able to index a full big database, think about the restart if the indexer. Having numeric Id and last update field helped me.

Had you thought about numbers?

Télécharger BlueMail pour Android<https://bluemail.me>
Le 23 sept. 2022, à 09:30, "Hrvoje Lon?ar" <horvoje@gmail.com<mailto:horvoje@gmail.com>> a écrit:
Hi Stephane!

Actually, I have excactly that kind of conversion, but I didn't mention as my mail was long enough whithout it :)
My main concern it should I let Lucene index original keywords or not.
Considering what you wrote, I guess your answer would be to store only converted values without exotic characters.

Thanks a lot for your reply!

BR,
Hrvoje

On Thu, Sep 22, 2022 at 7:53 PM Stephane Passignat < passignat@hotmail.com<mailto:passignat@hotmail.com>> wrote:
Hello,

The way I did it took me some time and I almost sure it's applicable to all languages.

I normalized the words. Replacing letters or group of letters by another approaching one.

In french e é è ê ai ei sound a bit the same, and for someone who write mistakes having to use the right letters is very frustrating. So I transformed all of them into e...

Hope it helps

Télécharger BlueMail pour Android< https://bluemail.me>
Le 22 sept. 2022, à 16:37, "Hrvoje Lon?ar" < horvoje@gmail.com<mailto:horvoje@gmail.com><mailto: horvoje@gmail.com<mailto:horvoje@gmail.com>>> a écrit:

Hi!

I'm using Hibernate Search / Lucene to index my entities in Spring Boot
aplication.

One thing I'm not sure is how to handle Croatian specific letters.
Croatian language has few additional letters "*?* *?* *?* *?* *?* *?* *š*
*Š* *ž* *Ž*".
Letters "*?* *?*" are commonly replaced with "*dj* *DJ*" when no Croatian
letters available.

In my custom Hibernate bridge there is a step that replaces all Croatian
characters with appropriate ASCII replacements which means "*?*" becomes "
*c*", "*š*" becomes "*s*" and so on.
Later, when user enters search text, the same process is done to match
values from index.
There is one more good thing about it - some older users that used
computers in early ages when no Croatian letters were available - those
users type words without Croatian letters, automatically replacing "*?*" with
"*c*" and that fits my logic to get good search results.

For example, the title of my entity is: "*juha s ?ešnjakom u ?umbirom*".
My custom Hibernate String bridge converts it to "*juha cesnjakom dumbirom*
".
Then user enters "*juha s ?ešnjakom*".
Before issuing a search, the same conversion is made to users' query and
text sent to Lucene is "*juha cesnjakom*".
This is the way how I implemented it and it's working fine.

The other way would be to index original text and then find words with
Croatian characters, convert them to ASCII and add to original.
The title "*juha s ?ešnjakom i ?umbirom*" would become "*juha ?ešnjakom
?umbirom cesnjakom dumbirom*".
In that case there is no need to convert users' search terms because
both "*juha
s ?ešnjakom*" and "*juha s cesnjakom*" would return the same result.

My question is:
Is there any reason to switch to this alternative logic and have original
keywords indexed in parallel with those converted to ASCII?

Thanks!

BR,
Hrvoje

--
{{ Horvoje.net<https://horvoje.net/> ~~ VegCook.net<https://vegcook.net/> ~~ TheVegCat.com<https://thevegcat.com:9999/> ~~ Cuspajz.com<https://cuspajz.com/> ~~ VintageZagreb.net<https://vintagezagreb.net/> ~~ Sterilizacija.org<https://sterilizacija.org/> ~~ SmijSe.com<https://smijse.com/> ~~ HTMLutil.net<https://htmlutil.net/> ~~ HTTPinfo.net<https://httpinfo.net/> }}

Re: Best practice - preparing search term for Lucene [ In reply to ]

horvoje at gmail

Sep 23, 2022, 10:29 AM

Post #6 of 8 (570 views)

Good point!
For now I'll leave it normalized. Every search term coming from frontend is
stored and also its counter updated which will help me after some time to
see trends and to decide to change the logic or not.

P.S. Here is the funny part: in Croatian "pišanje" means peeing while
"pisanje" means writing (a book for example or a note). Also "šišanje"
means hair cutting while "sisanje" means sucking. ????

On Fri, 23 Sept 2022, 18:02 Michael Sokolov, <msokolov@gmail.com> wrote:

> I think it depends how precise you want to make the search. If you
> want to enable diacritic-sensitive search in order to avoid confusions
> when users actually are able to enter the diacritics, you can index
> both ways (ascii-folded and not folded) and not normalize the query
> terms. Or you can just fold everything and not worry about it. In
> French I know there are confusable words like "cote" which has at
> least a few different meanings depending on the accents. Not sure how
> it is in Croatian.
>

Re: Best practice - preparing search term for Lucene [ In reply to ]

horvoje at gmail

Sep 24, 2022, 11:46 AM

Post #7 of 8 (570 views)

Well, my bad is that I used wrong word. I'm not storing but just goving
keywords to analyzer. That was my mistake in writing. So far I don't index
exotic letters, but just normalized.
Additionally I put in index something like "Prod_3443" which is a product
ID for situation when specific product is to be checked and few other
things like "Tag_329" which gives me fast search by specific tag through
the products.

On Fri, 23 Sept 2022, 19:26 Stephane Passignat, <passignat@hotmail.com>
wrote:

> Hi
>
> I would don't store the original value. That's "just" an index. But store
> the value of your db identifiers, because I think you'll want it at some
> point. (I made the same kind of feature on top of datanucleus)
>
> I use to have tech id in my db. Even more since I started to use jdo jpa
> some 20 years ago.
>
> With Lucerne I would also suggest to store a pretty view on entities. This
> allows to have the ready to display info without querying the db.
> As you won't be able to index a full big database, think about the restart
> if the indexer. Having numeric Id and last update field helped me.
>
> Had you thought about numbers?
>
>
>
> Télécharger BlueMail pour Android<https://bluemail.me>
> Le 23 sept. 2022, à 09:30, "Hrvoje Lon?ar" <horvoje@gmail.com<mailto:
> horvoje@gmail.com>> a écrit:
> Hi Stephane!
>
> Actually, I have excactly that kind of conversion, but I didn't mention as
> my mail was long enough whithout it :)
> My main concern it should I let Lucene index original keywords or not.
> Considering what you wrote, I guess your answer would be to store only
> converted values without exotic characters.
>
> Thanks a lot for your reply!
>
> BR,
> Hrvoje
>
> On Thu, Sep 22, 2022 at 7:53 PM Stephane Passignat < passignat@hotmail.com
> <mailto:passignat@hotmail.com>> wrote:
> Hello,
>
> The way I did it took me some time and I almost sure it's applicable to
> all languages.
>
> I normalized the words. Replacing letters or group of letters by another
> approaching one.
>
> In french e é è ê ai ei sound a bit the same, and for someone who write
> mistakes having to use the right letters is very frustrating. So I
> transformed all of them into e...
>
> Hope it helps
>
> Télécharger BlueMail pour Android< https://bluemail.me>
> Le 22 sept. 2022, à 16:37, "Hrvoje Lon?ar" < horvoje@gmail.com<mailto:
> horvoje@gmail.com><mailto: horvoje@gmail.com<mailto:horvoje@gmail.com>>>
> a écrit:
>
> Hi!
>
> I'm using Hibernate Search / Lucene to index my entities in Spring Boot
> aplication.
>
> One thing I'm not sure is how to handle Croatian specific letters.
> Croatian language has few additional letters "*?* *?* *?* *?* *?* *?* *š*
> *Š* *ž* *Ž*".
> Letters "*?* *?*" are commonly replaced with "*dj* *DJ*" when no Croatian
> letters available.
>
> In my custom Hibernate bridge there is a step that replaces all Croatian
> characters with appropriate ASCII replacements which means "*?*" becomes "
> *c*", "*š*" becomes "*s*" and so on.
> Later, when user enters search text, the same process is done to match
> values from index.
> There is one more good thing about it - some older users that used
> computers in early ages when no Croatian letters were available - those
> users type words without Croatian letters, automatically replacing "*?*"
> with
> "*c*" and that fits my logic to get good search results.
>
> For example, the title of my entity is: "*juha s ?ešnjakom u ?umbirom*".
> My custom Hibernate String bridge converts it to "*juha cesnjakom dumbirom*
> ".
> Then user enters "*juha s ?ešnjakom*".
> Before issuing a search, the same conversion is made to users' query and
> text sent to Lucene is "*juha cesnjakom*".
> This is the way how I implemented it and it's working fine.
>
> The other way would be to index original text and then find words with
> Croatian characters, convert them to ASCII and add to original.
> The title "*juha s ?ešnjakom i ?umbirom*" would become "*juha ?ešnjakom
> ?umbirom cesnjakom dumbirom*".
> In that case there is no need to convert users' search terms because
> both "*juha
> s ?ešnjakom*" and "*juha s cesnjakom*" would return the same result.
>
> My question is:
> Is there any reason to switch to this alternative logic and have original
> keywords indexed in parallel with those converted to ASCII?
>
> Thanks!
>
> BR,
> Hrvoje
>
>
> --
> {{ Horvoje.net<https://horvoje.net/> ~~ VegCook.net<https://vegcook.net/>
> ~~ TheVegCat.com<https://thevegcat.com:9999/> ~~ Cuspajz.com<
> https://cuspajz.com/> ~~ VintageZagreb.net<https://vintagezagreb.net/>
> ~~ Sterilizacija.org<https://sterilizacija.org/> ~~ SmijSe.com<
> https://smijse.com/> ~~ HTMLutil.net<https://htmlutil.net/> ~~
> HTTPinfo.net<https://httpinfo.net/> }}
>

Re: Best practice - preparing search term for Lucene [ In reply to ]

horvoje at gmail

Sep 24, 2022, 11:49 AM

Post #8 of 8 (570 views)

Oh yes, I also use Spring Cache which works fine and I don't have to store
products in Lucene making index smaller and faster.

On Fri, 23 Sept 2022, 19:26 Stephane Passignat, <passignat@hotmail.com>
wrote:

> Hi
>
> I would don't store the original value. That's "just" an index. But store
> the value of your db identifiers, because I think you'll want it at some
> point. (I made the same kind of feature on top of datanucleus)
>
> I use to have tech id in my db. Even more since I started to use jdo jpa
> some 20 years ago.
>
> With Lucerne I would also suggest to store a pretty view on entities. This
> allows to have the ready to display info without querying the db.
> As you won't be able to index a full big database, think about the restart
> if the indexer. Having numeric Id and last update field helped me.
>
> Had you thought about numbers?
>
>
>
> Télécharger BlueMail pour Android<https://bluemail.me>
> Le 23 sept. 2022, à 09:30, "Hrvoje Lon?ar" <horvoje@gmail.com<mailto:
> horvoje@gmail.com>> a écrit:
> Hi Stephane!
>
> Actually, I have excactly that kind of conversion, but I didn't mention as
> my mail was long enough whithout it :)
> My main concern it should I let Lucene index original keywords or not.
> Considering what you wrote, I guess your answer would be to store only
> converted values without exotic characters.
>
> Thanks a lot for your reply!
>
> BR,
> Hrvoje
>
> On Thu, Sep 22, 2022 at 7:53 PM Stephane Passignat < passignat@hotmail.com
> <mailto:passignat@hotmail.com>> wrote:
> Hello,
>
> The way I did it took me some time and I almost sure it's applicable to
> all languages.
>
> I normalized the words. Replacing letters or group of letters by another
> approaching one.
>
> In french e é è ê ai ei sound a bit the same, and for someone who write
> mistakes having to use the right letters is very frustrating. So I
> transformed all of them into e...
>
> Hope it helps
>
> Télécharger BlueMail pour Android< https://bluemail.me>
> Le 22 sept. 2022, à 16:37, "Hrvoje Lon?ar" < horvoje@gmail.com<mailto:
> horvoje@gmail.com><mailto: horvoje@gmail.com<mailto:horvoje@gmail.com>>>
> a écrit:
>
> Hi!
>
> I'm using Hibernate Search / Lucene to index my entities in Spring Boot
> aplication.
>
> One thing I'm not sure is how to handle Croatian specific letters.
> Croatian language has few additional letters "*?* *?* *?* *?* *?* *?* *š*
> *Š* *ž* *Ž*".
> Letters "*?* *?*" are commonly replaced with "*dj* *DJ*" when no Croatian
> letters available.
>
> In my custom Hibernate bridge there is a step that replaces all Croatian
> characters with appropriate ASCII replacements which means "*?*" becomes "
> *c*", "*š*" becomes "*s*" and so on.
> Later, when user enters search text, the same process is done to match
> values from index.
> There is one more good thing about it - some older users that used
> computers in early ages when no Croatian letters were available - those
> users type words without Croatian letters, automatically replacing "*?*"
> with
> "*c*" and that fits my logic to get good search results.
>
> For example, the title of my entity is: "*juha s ?ešnjakom u ?umbirom*".
> My custom Hibernate String bridge converts it to "*juha cesnjakom dumbirom*
> ".
> Then user enters "*juha s ?ešnjakom*".
> Before issuing a search, the same conversion is made to users' query and
> text sent to Lucene is "*juha cesnjakom*".
> This is the way how I implemented it and it's working fine.
>
> The other way would be to index original text and then find words with
> Croatian characters, convert them to ASCII and add to original.
> The title "*juha s ?ešnjakom i ?umbirom*" would become "*juha ?ešnjakom
> ?umbirom cesnjakom dumbirom*".
> In that case there is no need to convert users' search terms because
> both "*juha
> s ?ešnjakom*" and "*juha s cesnjakom*" would return the same result.
>
> My question is:
> Is there any reason to switch to this alternative logic and have original
> keywords indexed in parallel with those converted to ASCII?
>
> Thanks!
>
> BR,
> Hrvoje
>
>
> --
> {{ Horvoje.net<https://horvoje.net/> ~~ VegCook.net<https://vegcook.net/>
> ~~ TheVegCat.com<https://thevegcat.com:9999/> ~~ Cuspajz.com<
> https://cuspajz.com/> ~~ VintageZagreb.net<https://vintagezagreb.net/>
> ~~ Sterilizacija.org<https://sterilizacija.org/> ~~ SmijSe.com<
> https://smijse.com/> ~~ HTMLutil.net<https://htmlutil.net/> ~~
> HTTPinfo.net<https://httpinfo.net/> }}
>