Mailing List Archive

Question for SynonymQuery
Hi Lucene users,

I recently came across SynonymQuery and found out that it only supports
single-term synonyms (since it accepts a list of Term which will be
considered as synonyms). We have some multi-term synonyms like "internet
device" <-> "wifi router" or "dns" <-> "domain name service". Am I right
that I need to use something like a BooleanQuery for these cases?

I have 2 other follow-up questions:
- Does SynonymQuery have any advantage over BooleanQuery? Or is it only
different in how scores are computed? As I understand SynonymWeight will
consider all terms as exactly the same while BooleanQuery will favor the
documents with more matched terms.
- Is it worth it to support multi-term synonyms in SynonymQuery? My feeling
is that it's better to just use BooleanQuery in those cases, since to
support multi-term synonyms it needs to accept a list of Query, which would
make it behave like a BooleanQuery. Also how scoring works with multi-term
is another problem.

Thanks & Regards!
Re: Question for SynonymQuery [ In reply to ]
Hello.

1)Yes. That's the purpose.
2) I've skimmed through QueryBuilder.java. Conclusion is that it creates
BQ.SHOULD (however, there should be something like DisjunctionMaxQuery)
over PhraseQuery or MultiPhraseQuery (-ies).
Good hack!

On Wed, Dec 28, 2022 at 2:23 AM Anh D?ng Bùi <dungba.sg@gmail.com> wrote:

> Hi Lucene users,
>
> I recently came across SynonymQuery and found out that it only supports
> single-term synonyms (since it accepts a list of Term which will be
> considered as synonyms). We have some multi-term synonyms like "internet
> device" <-> "wifi router" or "dns" <-> "domain name service". Am I right
> that I need to use something like a BooleanQuery for these cases?
>
> I have 2 other follow-up questions:
> - Does SynonymQuery have any advantage over BooleanQuery? Or is it only
> different in how scores are computed? As I understand SynonymWeight will
> consider all terms as exactly the same while BooleanQuery will favor the
> documents with more matched terms.
> - Is it worth it to support multi-term synonyms in SynonymQuery? My feeling
> is that it's better to just use BooleanQuery in those cases, since to
> support multi-term synonyms it needs to accept a list of Query, which would
> make it behave like a BooleanQuery. Also how scoring works with multi-term
> is another problem.
>
> Thanks & Regards!
>


--
Sincerely yours
Mikhail Khludnev
Re: Question for SynonymQuery [ In reply to ]
Hi Anh

The following Stackoverflow link might help

https://stackoverflow.com/questions/73240494/can-someone-assist-me-with-a-multi-word-synonym-problem-in-lucene

The following thread seems to confirm, that escaping the space with a
backslash does not help

https://lists.apache.org/list?java-user@lucene.apache.org:2022-3

HTH

Michael


Am 27.12.22 um 20:22 schrieb Anh D?ng Bùi:
> Hi Lucene users,
>
> I recently came across SynonymQuery and found out that it only supports
> single-term synonyms (since it accepts a list of Term which will be
> considered as synonyms). We have some multi-term synonyms like "internet
> device" <-> "wifi router" or "dns" <-> "domain name service". Am I right
> that I need to use something like a BooleanQuery for these cases?
>
> I have 2 other follow-up questions:
> - Does SynonymQuery have any advantage over BooleanQuery? Or is it only
> different in how scores are computed? As I understand SynonymWeight will
> consider all terms as exactly the same while BooleanQuery will favor the
> documents with more matched terms.
> - Is it worth it to support multi-term synonyms in SynonymQuery? My feeling
> is that it's better to just use BooleanQuery in those cases, since to
> support multi-term synonyms it needs to accept a list of Query, which would
> make it behave like a BooleanQuery. Also how scoring works with multi-term
> is another problem.
>
> Thanks & Regards!
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Question for SynonymQuery [ In reply to ]
Thanks everyone for the insight. I guess I'll use BooleanQuery then.

There is also a caveat I noticed (not sure if it's an issue or not), which
is slightly different from the mentioned thread. When I have a multi-word
synonym, let say "wifi router" and "internet device". Then using
SynonymGraphFilter at query time (when building the SynonymMap I already
escaped space with the backslash) would produce this TokenStream for a
query of "wifi router"

"wifi" (PositionIncrement=1,PositionLength=1), "internet"
(PositionIncrement=0,PositionLength=1), "router"
(PositionIncrement=1,PositionLength=1), "device"
(PositionIncrement=0,PositionLength=1)

This has the same effect as if I had 2 synonyms: "wifi"/"internet" and
"router"/"device". If I convert this to a BooleanQuery it would become
("wifi" OR "internet") AND ("router" OR "device"), but what I would like to
achieve is ("wifi" AND "router") OR ("internet" AND "device")

I'm curious if there would be some workaround for this case

Thanks,
Anh Dung Bui


On Thu, Dec 29, 2022 at 4:56 AM Michael Wechner <michael.wechner@wyona.com>
wrote:

> Hi Anh
>
> The following Stackoverflow link might help
>
>
> https://stackoverflow.com/questions/73240494/can-someone-assist-me-with-a-multi-word-synonym-problem-in-lucene
>
> The following thread seems to confirm, that escaping the space with a
> backslash does not help
>
> https://lists.apache.org/list?java-user@lucene.apache.org:2022-3
>
> HTH
>
> Michael
>
>
> Am 27.12.22 um 20:22 schrieb Anh D?ng Bùi:
> > Hi Lucene users,
> >
> > I recently came across SynonymQuery and found out that it only supports
> > single-term synonyms (since it accepts a list of Term which will be
> > considered as synonyms). We have some multi-term synonyms like "internet
> > device" <-> "wifi router" or "dns" <-> "domain name service". Am I right
> > that I need to use something like a BooleanQuery for these cases?
> >
> > I have 2 other follow-up questions:
> > - Does SynonymQuery have any advantage over BooleanQuery? Or is it only
> > different in how scores are computed? As I understand SynonymWeight will
> > consider all terms as exactly the same while BooleanQuery will favor the
> > documents with more matched terms.
> > - Is it worth it to support multi-term synonyms in SynonymQuery? My
> feeling
> > is that it's better to just use BooleanQuery in those cases, since to
> > support multi-term synonyms it needs to accept a list of Query, which
> would
> > make it behave like a BooleanQuery. Also how scoring works with
> multi-term
> > is another problem.
> >
> > Thanks & Regards!
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
Re: Question for SynonymQuery [ In reply to ]
Hello Anh,
I was intrigued by your question. And I managed it to work somehow.
see
https://github.com/mkhludnev/likely/blob/eval-mulyw-syns/src/test/java/org/apache/lucene/playground/TestMultiPulty.java
Beware, synonym files
https://github.com/mkhludnev/likely/blob/eval-mulyw-syns/src/test/resources/org/apache/lucene/playground/multy-syn.txt
should use
https://lucene.apache.org/core/8_0_0/analyzers-common/org/apache/lucene/analysis/synonym/SynonymMap.html#WORD_SEPARATOR
Have a nice hack!

On Thu, Dec 29, 2022 at 10:00 AM Anh D?ng Bùi <dungba.sg@gmail.com> wrote:

> Thanks everyone for the insight. I guess I'll use BooleanQuery then.
>
> There is also a caveat I noticed (not sure if it's an issue or not), which
> is slightly different from the mentioned thread. When I have a multi-word
> synonym, let say "wifi router" and "internet device". Then using
> SynonymGraphFilter at query time (when building the SynonymMap I already
> escaped space with the backslash) would produce this TokenStream for a
> query of "wifi router"
>
> "wifi" (PositionIncrement=1,PositionLength=1), "internet"
> (PositionIncrement=0,PositionLength=1), "router"
> (PositionIncrement=1,PositionLength=1), "device"
> (PositionIncrement=0,PositionLength=1)
>
> This has the same effect as if I had 2 synonyms: "wifi"/"internet" and
> "router"/"device". If I convert this to a BooleanQuery it would become
> ("wifi" OR "internet") AND ("router" OR "device"), but what I would like to
> achieve is ("wifi" AND "router") OR ("internet" AND "device")
>
> I'm curious if there would be some workaround for this case
>
> Thanks,
> Anh Dung Bui
>
>
> On Thu, Dec 29, 2022 at 4:56 AM Michael Wechner <michael.wechner@wyona.com
> >
> wrote:
>
> > Hi Anh
> >
> > The following Stackoverflow link might help
> >
> >
> >
> https://stackoverflow.com/questions/73240494/can-someone-assist-me-with-a-multi-word-synonym-problem-in-lucene
> >
> > The following thread seems to confirm, that escaping the space with a
> > backslash does not help
> >
> > https://lists.apache.org/list?java-user@lucene.apache.org:2022-3
> >
> > HTH
> >
> > Michael
> >
> >
> > Am 27.12.22 um 20:22 schrieb Anh D?ng Bùi:
> > > Hi Lucene users,
> > >
> > > I recently came across SynonymQuery and found out that it only supports
> > > single-term synonyms (since it accepts a list of Term which will be
> > > considered as synonyms). We have some multi-term synonyms like
> "internet
> > > device" <-> "wifi router" or "dns" <-> "domain name service". Am I
> right
> > > that I need to use something like a BooleanQuery for these cases?
> > >
> > > I have 2 other follow-up questions:
> > > - Does SynonymQuery have any advantage over BooleanQuery? Or is it only
> > > different in how scores are computed? As I understand SynonymWeight
> will
> > > consider all terms as exactly the same while BooleanQuery will favor
> the
> > > documents with more matched terms.
> > > - Is it worth it to support multi-term synonyms in SynonymQuery? My
> > feeling
> > > is that it's better to just use BooleanQuery in those cases, since to
> > > support multi-term synonyms it needs to accept a list of Query, which
> > would
> > > make it behave like a BooleanQuery. Also how scoring works with
> > multi-term
> > > is another problem.
> > >
> > > Thanks & Regards!
> > >
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
>


--
Sincerely yours
Mikhail Khludnev
https://t.me/MUST_SEARCH
A caveat: Cyrillic!
RE: Question for SynonymQuery [ In reply to ]
Hi Anh

The two links Michael shared relate to questions I asked when I was trying to get synonym matching with our application.

I really do have multi-term synonym matching working at this point; there's always scope for improvement of course but with the hints suppled in those threads I was able to index our documents and search them using a variety of synonymous terms, both single words and phrases.

Our application does not use either BooleanQuery or SynonymQuery; I have just used the standard QueryParser. Instead the synonym processing occurs in the indexing phase, which is not only simpler (one search pattern, one query), but also I think you would also find it gives you superior performance (because the synonym processing occurs once at indexing time and not at all during searching - and I'm sure you'll be doing far more searching than indexing).

cheers
T


-----Original Message-----
From: Michael Wechner <michael.wechner@wyona.com>
Sent: Thursday, 29 December 2022 08:56
To: java-user@lucene.apache.org
Subject: Re: Question for SynonymQuery

Hi Anh

The following Stackoverflow link might help

https://stackoverflow.com/questions/73240494/can-someone-assist-me-with-a-multi-word-synonym-problem-in-lucene

The following thread seems to confirm, that escaping the space with a backslash does not help

https://lists.apache.org/list?java-user@lucene.apache.org:2022-3

HTH

Michael


Am 27.12.22 um 20:22 schrieb Anh D?ng Bùi:
> Hi Lucene users,
>
> I recently came across SynonymQuery and found out that it only
> supports single-term synonyms (since it accepts a list of Term which
> will be considered as synonyms). We have some multi-term synonyms like
> "internet device" <-> "wifi router" or "dns" <-> "domain name
> service". Am I right that I need to use something like a BooleanQuery for these cases?
>
> I have 2 other follow-up questions:
> - Does SynonymQuery have any advantage over BooleanQuery? Or is it
> only different in how scores are computed? As I understand
> SynonymWeight will consider all terms as exactly the same while
> BooleanQuery will favor the documents with more matched terms.
> - Is it worth it to support multi-term synonyms in SynonymQuery? My
> feeling is that it's better to just use BooleanQuery in those cases,
> since to support multi-term synonyms it needs to accept a list of
> Query, which would make it behave like a BooleanQuery. Also how
> scoring works with multi-term is another problem.
>
> Thanks & Regards!
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Question for SynonymQuery [ In reply to ]
Hello Trevor.
Can you help me better understand this approach? If we have a text "wifi
router" and inject "internet device" at indexing time, terms reside at the
same positions. How to avoid false positive match for query "wifi device"?

On Mon, Jan 2, 2023 at 4:16 PM Trevor Nicholls <trevor@castingthevoid.com>
wrote:

> Hi Anh
>
> The two links Michael shared relate to questions I asked when I was trying
> to get synonym matching with our application.
>
> I really do have multi-term synonym matching working at this point;
> there's always scope for improvement of course but with the hints suppled
> in those threads I was able to index our documents and search them using a
> variety of synonymous terms, both single words and phrases.
>
> Our application does not use either BooleanQuery or SynonymQuery; I have
> just used the standard QueryParser. Instead the synonym processing occurs
> in the indexing phase, which is not only simpler (one search pattern, one
> query), but also I think you would also find it gives you superior
> performance (because the synonym processing occurs once at indexing time
> and not at all during searching - and I'm sure you'll be doing far more
> searching than indexing).
>
> cheers
> T
>
>
> -----Original Message-----
> From: Michael Wechner <michael.wechner@wyona.com>
> Sent: Thursday, 29 December 2022 08:56
> To: java-user@lucene.apache.org
> Subject: Re: Question for SynonymQuery
>
> Hi Anh
>
> The following Stackoverflow link might help
>
>
> https://stackoverflow.com/questions/73240494/can-someone-assist-me-with-a-multi-word-synonym-problem-in-lucene
>
> The following thread seems to confirm, that escaping the space with a
> backslash does not help
>
> https://lists.apache.org/list?java-user@lucene.apache.org:2022-3
>
> HTH
>
> Michael
>
>
> Am 27.12.22 um 20:22 schrieb Anh D?ng Bùi:
> > Hi Lucene users,
> >
> > I recently came across SynonymQuery and found out that it only
> > supports single-term synonyms (since it accepts a list of Term which
> > will be considered as synonyms). We have some multi-term synonyms like
> > "internet device" <-> "wifi router" or "dns" <-> "domain name
> > service". Am I right that I need to use something like a BooleanQuery
> for these cases?
> >
> > I have 2 other follow-up questions:
> > - Does SynonymQuery have any advantage over BooleanQuery? Or is it
> > only different in how scores are computed? As I understand
> > SynonymWeight will consider all terms as exactly the same while
> > BooleanQuery will favor the documents with more matched terms.
> > - Is it worth it to support multi-term synonyms in SynonymQuery? My
> > feeling is that it's better to just use BooleanQuery in those cases,
> > since to support multi-term synonyms it needs to accept a list of
> > Query, which would make it behave like a BooleanQuery. Also how
> > scoring works with multi-term is another problem.
> >
> > Thanks & Regards!
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

--
Sincerely yours
Mikhail Khludnev
https://t.me/MUST_SEARCH
A caveat: Cyrillic!
Re: Question for SynonymQuery [ In reply to ]
independent of the synonym implementation you might want to consider vector/similarity search, for example if the query is "internet device",
then the cosine similarity of the multi-terms "internet device", "wifi router" and "wifi device" using the "all-mpnet-base-v2" are

{"cosineSimilarity":1,"cosineDistance":0,"sentenceOne":"internet
device","sentenceTwo":"internet device"}

{"cosineSimilarity":0.47380197,"cosineDistance":0.526198,"sentenceOne":"internet
device","sentenceTwo":"wifi router"}

{"cosineSimilarity":0.74852204,"cosineDistance":0.25147796,"sentenceOne":"internet
device","sentenceTwo":"wifi device"} whereas as you can see "wifi
device" is closer to "internet device" than "wifi router" to "internet
device" using the model "all-mpnet-base-v2", whereas if you consider
"wifi device" a false positive, then it is not helpful of course, but it
might be useful otherwise considering the original question of this
thread. HTH Michael



Am 02.01.23 um 17:54 schrieb Mikhail Khludnev:
> Hello Trevor.
> Can you help me better understand this approach? If we have a text "wifi
> router" and inject "internet device" at indexing time, terms reside at the
> same positions. How to avoid false positive match for query "wifi device"?
>
> On Mon, Jan 2, 2023 at 4:16 PM Trevor Nicholls<trevor@castingthevoid.com>
> wrote:
>
>> Hi Anh
>>
>> The two links Michael shared relate to questions I asked when I was trying
>> to get synonym matching with our application.
>>
>> I really do have multi-term synonym matching working at this point;
>> there's always scope for improvement of course but with the hints suppled
>> in those threads I was able to index our documents and search them using a
>> variety of synonymous terms, both single words and phrases.
>>
>> Our application does not use either BooleanQuery or SynonymQuery; I have
>> just used the standard QueryParser. Instead the synonym processing occurs
>> in the indexing phase, which is not only simpler (one search pattern, one
>> query), but also I think you would also find it gives you superior
>> performance (because the synonym processing occurs once at indexing time
>> and not at all during searching - and I'm sure you'll be doing far more
>> searching than indexing).
>>
>> cheers
>> T
>>
>>
>> -----Original Message-----
>> From: Michael Wechner<michael.wechner@wyona.com>
>> Sent: Thursday, 29 December 2022 08:56
>> To:java-user@lucene.apache.org
>> Subject: Re: Question for SynonymQuery
>>
>> Hi Anh
>>
>> The following Stackoverflow link might help
>>
>>
>> https://stackoverflow.com/questions/73240494/can-someone-assist-me-with-a-multi-word-synonym-problem-in-lucene
>>
>> The following thread seems to confirm, that escaping the space with a
>> backslash does not help
>>
>> https://lists.apache.org/list?java-user@lucene.apache.org:2022-3
>>
>> HTH
>>
>> Michael
>>
>>
>> Am 27.12.22 um 20:22 schrieb Anh D?ng Bùi:
>>> Hi Lucene users,
>>>
>>> I recently came across SynonymQuery and found out that it only
>>> supports single-term synonyms (since it accepts a list of Term which
>>> will be considered as synonyms). We have some multi-term synonyms like
>>> "internet device" <-> "wifi router" or "dns" <-> "domain name
>>> service". Am I right that I need to use something like a BooleanQuery
>> for these cases?
>>> I have 2 other follow-up questions:
>>> - Does SynonymQuery have any advantage over BooleanQuery? Or is it
>>> only different in how scores are computed? As I understand
>>> SynonymWeight will consider all terms as exactly the same while
>>> BooleanQuery will favor the documents with more matched terms.
>>> - Is it worth it to support multi-term synonyms in SynonymQuery? My
>>> feeling is that it's better to just use BooleanQuery in those cases,
>>> since to support multi-term synonyms it needs to accept a list of
>>> Query, which would make it behave like a BooleanQuery. Also how
>>> scoring works with multi-term is another problem.
>>>
>>> Thanks & Regards!
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail:java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail:java-user-help@lucene.apache.org
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail:java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail:java-user-help@lucene.apache.org
>>
>>
RE: Question for SynonymQuery [ In reply to ]
Hi Mikhail

Yes, if my text contains "wifi router", and my synonym map includes "wifi router","internet device", then if I search for "wifi device" I will get a match. While I can see that on the strictest criteria this might be incorrect, in practice I would happily see that returned as a match. I wouldn't call it a false positive, it's more like an unintended benefit.

No doubt there are pathological cases where I would not be so happy but nobody has come up with one in our application yet. As I said there's scope for improvement in our implementation, but at this point I'm not convinced that the benefit of plugging this gap justifies the cost.

If somebody points you to a better option I would also be interested in seeing it.

cheers
T

-----Original Message-----
From: Mikhail Khludnev <mkhl@apache.org>
Sent: Tuesday, 3 January 2023 09:55
To: java-user@lucene.apache.org
Subject: Re: Question for SynonymQuery

Hello Trevor.
Can you help me better understand this approach? If we have a text "wifi router" and inject "internet device" at indexing time, terms reside at the same positions. How to avoid false positive match for query "wifi device"?

On Mon, Jan 2, 2023 at 4:16 PM Trevor Nicholls <trevor@castingthevoid.com>
wrote:

> Hi Anh
>
> The two links Michael shared relate to questions I asked when I was
> trying to get synonym matching with our application.
>
> I really do have multi-term synonym matching working at this point;
> there's always scope for improvement of course but with the hints
> suppled in those threads I was able to index our documents and search
> them using a variety of synonymous terms, both single words and phrases.
>
> Our application does not use either BooleanQuery or SynonymQuery; I
> have just used the standard QueryParser. Instead the synonym
> processing occurs in the indexing phase, which is not only simpler
> (one search pattern, one query), but also I think you would also find
> it gives you superior performance (because the synonym processing
> occurs once at indexing time and not at all during searching - and I'm
> sure you'll be doing far more searching than indexing).
>
> cheers
> T
>
>
> -----Original Message-----
> From: Michael Wechner <michael.wechner@wyona.com>
> Sent: Thursday, 29 December 2022 08:56
> To: java-user@lucene.apache.org
> Subject: Re: Question for SynonymQuery
>
> Hi Anh
>
> The following Stackoverflow link might help
>
>
> https://stackoverflow.com/questions/73240494/can-someone-assist-me-wit
> h-a-multi-word-synonym-problem-in-lucene
>
> The following thread seems to confirm, that escaping the space with a
> backslash does not help
>
> https://lists.apache.org/list?java-user@lucene.apache.org:2022-3
>
> HTH
>
> Michael
>
>
> Am 27.12.22 um 20:22 schrieb Anh D?ng Bùi:
> > Hi Lucene users,
> >
> > I recently came across SynonymQuery and found out that it only
> > supports single-term synonyms (since it accepts a list of Term which
> > will be considered as synonyms). We have some multi-term synonyms
> > like "internet device" <-> "wifi router" or "dns" <-> "domain name
> > service". Am I right that I need to use something like a
> > BooleanQuery
> for these cases?
> >
> > I have 2 other follow-up questions:
> > - Does SynonymQuery have any advantage over BooleanQuery? Or is it
> > only different in how scores are computed? As I understand
> > SynonymWeight will consider all terms as exactly the same while
> > BooleanQuery will favor the documents with more matched terms.
> > - Is it worth it to support multi-term synonyms in SynonymQuery? My
> > feeling is that it's better to just use BooleanQuery in those cases,
> > since to support multi-term synonyms it needs to accept a list of
> > Query, which would make it behave like a BooleanQuery. Also how
> > scoring works with multi-term is another problem.
> >
> > Thanks & Regards!
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

--
Sincerely yours
Mikhail Khludnev
https://t.me/MUST_SEARCH
A caveat: Cyrillic!


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Question for SynonymQuery [ In reply to ]
Thanks Mikhail!

It turns out I used FlattenGraphFilter and cause the PositionLength to be
all 1 and resulted in the behavior above =)

A side note is that we don't need to use WORD_SEPARATOR in the synonym
file. SynonymMap.Parser.analyze would tokenize and append the separator for
us.

Regards,
Anh Dung Bui

On Mon, Jan 2, 2023 at 8:07 Mikhail Khludnev <mkhl@apache.org> wrote:

> Hello Anh,
> I was intrigued by your question. And I managed it to work somehow.
> see
>
> https://github.com/mkhludnev/likely/blob/eval-mulyw-syns/src/test/java/org/apache/lucene/playground/TestMultiPulty.java
> Beware, synonym files
>
> https://github.com/mkhludnev/likely/blob/eval-mulyw-syns/src/test/resources/org/apache/lucene/playground/multy-syn.txt
> should use
>
> https://lucene.apache.org/core/8_0_0/analyzers-common/org/apache/lucene/analysis/synonym/SynonymMap.html#WORD_SEPARATOR
> Have a nice hack!
>
> On Thu, Dec 29, 2022 at 10:00 AM Anh D?ng Bùi <dungba.sg@gmail.com> wrote:
>
> > Thanks everyone for the insight. I guess I'll use BooleanQuery then.
> >
> > There is also a caveat I noticed (not sure if it's an issue or not),
> which
> > is slightly different from the mentioned thread. When I have a multi-word
> > synonym, let say "wifi router" and "internet device". Then using
> > SynonymGraphFilter at query time (when building the SynonymMap I already
> > escaped space with the backslash) would produce this TokenStream for a
> > query of "wifi router"
> >
> > "wifi" (PositionIncrement=1,PositionLength=1), "internet"
> > (PositionIncrement=0,PositionLength=1), "router"
> > (PositionIncrement=1,PositionLength=1), "device"
> > (PositionIncrement=0,PositionLength=1)
> >
> > This has the same effect as if I had 2 synonyms: "wifi"/"internet" and
> > "router"/"device". If I convert this to a BooleanQuery it would become
> > ("wifi" OR "internet") AND ("router" OR "device"), but what I would like
> to
> > achieve is ("wifi" AND "router") OR ("internet" AND "device")
> >
> > I'm curious if there would be some workaround for this case
> >
> > Thanks,
> > Anh Dung Bui
> >
> >
> > On Thu, Dec 29, 2022 at 4:56 AM Michael Wechner <
> michael.wechner@wyona.com
> > >
> > wrote:
> >
> > > Hi Anh
> > >
> > > The following Stackoverflow link might help
> > >
> > >
> > >
> >
> https://stackoverflow.com/questions/73240494/can-someone-assist-me-with-a-multi-word-synonym-problem-in-lucene
> > >
> > > The following thread seems to confirm, that escaping the space with a
> > > backslash does not help
> > >
> > > https://lists.apache.org/list?java-user@lucene.apache.org:2022-3
> > >
> > > HTH
> > >
> > > Michael
> > >
> > >
> > > Am 27.12.22 um 20:22 schrieb Anh D?ng Bùi:
> > > > Hi Lucene users,
> > > >
> > > > I recently came across SynonymQuery and found out that it only
> supports
> > > > single-term synonyms (since it accepts a list of Term which will be
> > > > considered as synonyms). We have some multi-term synonyms like
> > "internet
> > > > device" <-> "wifi router" or "dns" <-> "domain name service". Am I
> > right
> > > > that I need to use something like a BooleanQuery for these cases?
> > > >
> > > > I have 2 other follow-up questions:
> > > > - Does SynonymQuery have any advantage over BooleanQuery? Or is it
> only
> > > > different in how scores are computed? As I understand SynonymWeight
> > will
> > > > consider all terms as exactly the same while BooleanQuery will favor
> > the
> > > > documents with more matched terms.
> > > > - Is it worth it to support multi-term synonyms in SynonymQuery? My
> > > feeling
> > > > is that it's better to just use BooleanQuery in those cases, since to
> > > > support multi-term synonyms it needs to accept a list of Query, which
> > > would
> > > > make it behave like a BooleanQuery. Also how scoring works with
> > > multi-term
> > > > is another problem.
> > > >
> > > > Thanks & Regards!
> > > >
> > >
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > >
> > >
> >
>
>
> --
> Sincerely yours
> Mikhail Khludnev
> https://t.me/MUST_SEARCH
> A caveat: Cyrillic!
>
Re: Question for SynonymQuery [ In reply to ]
Hey Mikhail and Anh Dung Bui
i am also struggling with synonym query
my use case for eg
I created synonyms for word
API ------> Application program interface
UI ---------> user interface

doc 1 ---> This is API and it is called Application program interface
doc2 ----> How i help you in UI things
doc3-----> my substance interface
doc4 ------> how to write c++ program

what i want to achieve is when i search for API UI together

expected result
it must highlight ---> API and Application program interface in doc1
------> UI in doc2

but coming output is
it highlighted ---> API and Application program interface in doc1
------> UI in doc2
-----> interface in doc 3
------> program in doc4

Do you have any suggesting how i achieve this

(API) OR (UI)
Each term act as phrase query for API UI
no single tokens be matched ,phrase should be matched





On Thu, Jan 19, 2023 at 6:56 AM Anh D?ng Bùi <dungba.sg@gmail.com> wrote:

> Thanks Mikhail!
>
> It turns out I used FlattenGraphFilter and cause the PositionLength to be
> all 1 and resulted in the behavior above =)
>
> A side note is that we don't need to use WORD_SEPARATOR in the synonym
> file. SynonymMap.Parser.analyze would tokenize and append the separator for
> us.
>
> Regards,
> Anh Dung Bui
>
> On Mon, Jan 2, 2023 at 8:07 Mikhail Khludnev <mkhl@apache.org> wrote:
>
> > Hello Anh,
> > I was intrigued by your question. And I managed it to work somehow.
> > see
> >
> >
> https://github.com/mkhludnev/likely/blob/eval-mulyw-syns/src/test/java/org/apache/lucene/playground/TestMultiPulty.java
> > Beware, synonym files
> >
> >
> https://github.com/mkhludnev/likely/blob/eval-mulyw-syns/src/test/resources/org/apache/lucene/playground/multy-syn.txt
> > should use
> >
> >
> https://lucene.apache.org/core/8_0_0/analyzers-common/org/apache/lucene/analysis/synonym/SynonymMap.html#WORD_SEPARATOR
> > Have a nice hack!
> >
> > On Thu, Dec 29, 2022 at 10:00 AM Anh D?ng Bùi <dungba.sg@gmail.com>
> wrote:
> >
> > > Thanks everyone for the insight. I guess I'll use BooleanQuery then.
> > >
> > > There is also a caveat I noticed (not sure if it's an issue or not),
> > which
> > > is slightly different from the mentioned thread. When I have a
> multi-word
> > > synonym, let say "wifi router" and "internet device". Then using
> > > SynonymGraphFilter at query time (when building the SynonymMap I
> already
> > > escaped space with the backslash) would produce this TokenStream for a
> > > query of "wifi router"
> > >
> > > "wifi" (PositionIncrement=1,PositionLength=1), "internet"
> > > (PositionIncrement=0,PositionLength=1), "router"
> > > (PositionIncrement=1,PositionLength=1), "device"
> > > (PositionIncrement=0,PositionLength=1)
> > >
> > > This has the same effect as if I had 2 synonyms: "wifi"/"internet" and
> > > "router"/"device". If I convert this to a BooleanQuery it would become
> > > ("wifi" OR "internet") AND ("router" OR "device"), but what I would
> like
> > to
> > > achieve is ("wifi" AND "router") OR ("internet" AND "device")
> > >
> > > I'm curious if there would be some workaround for this case
> > >
> > > Thanks,
> > > Anh Dung Bui
> > >
> > >
> > > On Thu, Dec 29, 2022 at 4:56 AM Michael Wechner <
> > michael.wechner@wyona.com
> > > >
> > > wrote:
> > >
> > > > Hi Anh
> > > >
> > > > The following Stackoverflow link might help
> > > >
> > > >
> > > >
> > >
> >
> https://stackoverflow.com/questions/73240494/can-someone-assist-me-with-a-multi-word-synonym-problem-in-lucene
> > > >
> > > > The following thread seems to confirm, that escaping the space with a
> > > > backslash does not help
> > > >
> > > > https://lists.apache.org/list?java-user@lucene.apache.org:2022-3
> > > >
> > > > HTH
> > > >
> > > > Michael
> > > >
> > > >
> > > > Am 27.12.22 um 20:22 schrieb Anh D?ng Bùi:
> > > > > Hi Lucene users,
> > > > >
> > > > > I recently came across SynonymQuery and found out that it only
> > supports
> > > > > single-term synonyms (since it accepts a list of Term which will be
> > > > > considered as synonyms). We have some multi-term synonyms like
> > > "internet
> > > > > device" <-> "wifi router" or "dns" <-> "domain name service". Am I
> > > right
> > > > > that I need to use something like a BooleanQuery for these cases?
> > > > >
> > > > > I have 2 other follow-up questions:
> > > > > - Does SynonymQuery have any advantage over BooleanQuery? Or is it
> > only
> > > > > different in how scores are computed? As I understand SynonymWeight
> > > will
> > > > > consider all terms as exactly the same while BooleanQuery will
> favor
> > > the
> > > > > documents with more matched terms.
> > > > > - Is it worth it to support multi-term synonyms in SynonymQuery? My
> > > > feeling
> > > > > is that it's better to just use BooleanQuery in those cases, since
> to
> > > > > support multi-term synonyms it needs to accept a list of Query,
> which
> > > > would
> > > > > make it behave like a BooleanQuery. Also how scoring works with
> > > > multi-term
> > > > > is another problem.
> > > > >
> > > > > Thanks & Regards!
> > > > >
> > > >
> > > >
> > > > ---------------------------------------------------------------------
> > > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > > >
> > > >
> > >
> >
> >
> > --
> > Sincerely yours
> > Mikhail Khludnev
> > https://t.me/MUST_SEARCH
> > A caveat: Cyrillic!
> >
>
Re: Question for SynonymQuery [ In reply to ]
Hello, Santam.
It seems I achieved what you asking for.
https://github.com/mkhludnev/likely/blob/381b491d25e4d2035dd5b8a891dfdcfe2b986b90/src/test/java/org/apache/lucene/playground/TestMultiPulty.java#L32
It expands API and UI into phrases, which match like you expect.

On Fri, Jan 20, 2023 at 4:18 PM _ SATNAM <satnamsingh9808@gmail.com> wrote:

> Hey Mikhail and Anh Dung Bui
> i am also struggling with synonym query
> my use case for eg
> I created synonyms for word
> API ------> Application program interface
> UI ---------> user interface
>
> doc 1 ---> This is API and it is called Application program interface
> doc2 ----> How i help you in UI things
> doc3-----> my substance interface
> doc4 ------> how to write c++ program
>
> what i want to achieve is when i search for API UI together
>
> expected result
> it must highlight ---> API and Application program interface in doc1
> ------> UI in doc2
>
> but coming output is
> it highlighted ---> API and Application program interface in doc1
> ------> UI in doc2
> -----> interface in doc 3
> ------> program in doc4
>
> Do you have any suggesting how i achieve this
>
> (API) OR (UI)
> Each term act as phrase query for API UI
> no single tokens be matched ,phrase should be matched
>
>
>
>
>
> On Thu, Jan 19, 2023 at 6:56 AM Anh D?ng Bùi <dungba.sg@gmail.com> wrote:
>
> > Thanks Mikhail!
> >
> > It turns out I used FlattenGraphFilter and cause the PositionLength to be
> > all 1 and resulted in the behavior above =)
> >
> > A side note is that we don't need to use WORD_SEPARATOR in the synonym
> > file. SynonymMap.Parser.analyze would tokenize and append the separator
> for
> > us.
> >
> > Regards,
> > Anh Dung Bui
> >
> > On Mon, Jan 2, 2023 at 8:07 Mikhail Khludnev <mkhl@apache.org> wrote:
> >
> > > Hello Anh,
> > > I was intrigued by your question. And I managed it to work somehow.
> > > see
> > >
> > >
> >
> https://github.com/mkhludnev/likely/blob/eval-mulyw-syns/src/test/java/org/apache/lucene/playground/TestMultiPulty.java
> > > Beware, synonym files
> > >
> > >
> >
> https://github.com/mkhludnev/likely/blob/eval-mulyw-syns/src/test/resources/org/apache/lucene/playground/multy-syn.txt
> > > should use
> > >
> > >
> >
> https://lucene.apache.org/core/8_0_0/analyzers-common/org/apache/lucene/analysis/synonym/SynonymMap.html#WORD_SEPARATOR
> > > Have a nice hack!
> > >
> > > On Thu, Dec 29, 2022 at 10:00 AM Anh D?ng Bùi <dungba.sg@gmail.com>
> > wrote:
> > >
> > > > Thanks everyone for the insight. I guess I'll use BooleanQuery then.
> > > >
> > > > There is also a caveat I noticed (not sure if it's an issue or not),
> > > which
> > > > is slightly different from the mentioned thread. When I have a
> > multi-word
> > > > synonym, let say "wifi router" and "internet device". Then using
> > > > SynonymGraphFilter at query time (when building the SynonymMap I
> > already
> > > > escaped space with the backslash) would produce this TokenStream for
> a
> > > > query of "wifi router"
> > > >
> > > > "wifi" (PositionIncrement=1,PositionLength=1), "internet"
> > > > (PositionIncrement=0,PositionLength=1), "router"
> > > > (PositionIncrement=1,PositionLength=1), "device"
> > > > (PositionIncrement=0,PositionLength=1)
> > > >
> > > > This has the same effect as if I had 2 synonyms: "wifi"/"internet"
> and
> > > > "router"/"device". If I convert this to a BooleanQuery it would
> become
> > > > ("wifi" OR "internet") AND ("router" OR "device"), but what I would
> > like
> > > to
> > > > achieve is ("wifi" AND "router") OR ("internet" AND "device")
> > > >
> > > > I'm curious if there would be some workaround for this case
> > > >
> > > > Thanks,
> > > > Anh Dung Bui
> > > >
> > > >
> > > > On Thu, Dec 29, 2022 at 4:56 AM Michael Wechner <
> > > michael.wechner@wyona.com
> > > > >
> > > > wrote:
> > > >
> > > > > Hi Anh
> > > > >
> > > > > The following Stackoverflow link might help
> > > > >
> > > > >
> > > > >
> > > >
> > >
> >
> https://stackoverflow.com/questions/73240494/can-someone-assist-me-with-a-multi-word-synonym-problem-in-lucene
> > > > >
> > > > > The following thread seems to confirm, that escaping the space
> with a
> > > > > backslash does not help
> > > > >
> > > > > https://lists.apache.org/list?java-user@lucene.apache.org:2022-3
> > > > >
> > > > > HTH
> > > > >
> > > > > Michael
> > > > >
> > > > >
> > > > > Am 27.12.22 um 20:22 schrieb Anh D?ng Bùi:
> > > > > > Hi Lucene users,
> > > > > >
> > > > > > I recently came across SynonymQuery and found out that it only
> > > supports
> > > > > > single-term synonyms (since it accepts a list of Term which will
> be
> > > > > > considered as synonyms). We have some multi-term synonyms like
> > > > "internet
> > > > > > device" <-> "wifi router" or "dns" <-> "domain name service". Am
> I
> > > > right
> > > > > > that I need to use something like a BooleanQuery for these cases?
> > > > > >
> > > > > > I have 2 other follow-up questions:
> > > > > > - Does SynonymQuery have any advantage over BooleanQuery? Or is
> it
> > > only
> > > > > > different in how scores are computed? As I understand
> SynonymWeight
> > > > will
> > > > > > consider all terms as exactly the same while BooleanQuery will
> > favor
> > > > the
> > > > > > documents with more matched terms.
> > > > > > - Is it worth it to support multi-term synonyms in SynonymQuery?
> My
> > > > > feeling
> > > > > > is that it's better to just use BooleanQuery in those cases,
> since
> > to
> > > > > > support multi-term synonyms it needs to accept a list of Query,
> > which
> > > > > would
> > > > > > make it behave like a BooleanQuery. Also how scoring works with
> > > > > multi-term
> > > > > > is another problem.
> > > > > >
> > > > > > Thanks & Regards!
> > > > > >
> > > > >
> > > > >
> > > > >
> ---------------------------------------------------------------------
> > > > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > > > >
> > > > >
> > > >
> > >
> > >
> > > --
> > > Sincerely yours
> > > Mikhail Khludnev
> > > https://t.me/MUST_SEARCH
> > > A caveat: Cyrillic!
> > >
> >
>


--
Sincerely yours
Mikhail Khludnev
https://t.me/MUST_SEARCH
A caveat: Cyrillic!
Re: Question for SynonymQuery [ In reply to ]
Right. SynonymMap.html#WORD_SEPARATOR
<https://lucene.apache.org/core/8_0_0/analyzers-common/org/apache/lucene/analysis/synonym/SynonymMap.html#WORD_SEPARATOR>
was
a redundant complication. Spaces work fine.

On Thu, Jan 19, 2023 at 4:26 AM Anh D?ng Bùi <dungba.sg@gmail.com> wrote:

> Thanks Mikhail!
>
> It turns out I used FlattenGraphFilter and cause the PositionLength to be
> all 1 and resulted in the behavior above =)
>
> A side note is that we don't need to use WORD_SEPARATOR in the synonym
> file. SynonymMap.Parser.analyze would tokenize and append the separator for
> us.
>
> Regards,
> Anh Dung Bui
>
> On Mon, Jan 2, 2023 at 8:07 Mikhail Khludnev <mkhl@apache.org> wrote:
>
> > Hello Anh,
> > I was intrigued by your question. And I managed it to work somehow.
> > see
> >
> >
> https://github.com/mkhludnev/likely/blob/eval-mulyw-syns/src/test/java/org/apache/lucene/playground/TestMultiPulty.java
> > Beware, synonym files
> >
> >
> https://github.com/mkhludnev/likely/blob/eval-mulyw-syns/src/test/resources/org/apache/lucene/playground/multy-syn.txt
> > should use
> >
> >
> https://lucene.apache.org/core/8_0_0/analyzers-common/org/apache/lucene/analysis/synonym/SynonymMap.html#WORD_SEPARATOR
> > Have a nice hack!
> >
> > On Thu, Dec 29, 2022 at 10:00 AM Anh D?ng Bùi <dungba.sg@gmail.com>
> wrote:
> >
> > > Thanks everyone for the insight. I guess I'll use BooleanQuery then.
> > >
> > > There is also a caveat I noticed (not sure if it's an issue or not),
> > which
> > > is slightly different from the mentioned thread. When I have a
> multi-word
> > > synonym, let say "wifi router" and "internet device". Then using
> > > SynonymGraphFilter at query time (when building the SynonymMap I
> already
> > > escaped space with the backslash) would produce this TokenStream for a
> > > query of "wifi router"
> > >
> > > "wifi" (PositionIncrement=1,PositionLength=1), "internet"
> > > (PositionIncrement=0,PositionLength=1), "router"
> > > (PositionIncrement=1,PositionLength=1), "device"
> > > (PositionIncrement=0,PositionLength=1)
> > >
> > > This has the same effect as if I had 2 synonyms: "wifi"/"internet" and
> > > "router"/"device". If I convert this to a BooleanQuery it would become
> > > ("wifi" OR "internet") AND ("router" OR "device"), but what I would
> like
> > to
> > > achieve is ("wifi" AND "router") OR ("internet" AND "device")
> > >
> > > I'm curious if there would be some workaround for this case
> > >
> > > Thanks,
> > > Anh Dung Bui
> > >
> > >
> > > On Thu, Dec 29, 2022 at 4:56 AM Michael Wechner <
> > michael.wechner@wyona.com
> > > >
> > > wrote:
> > >
> > > > Hi Anh
> > > >
> > > > The following Stackoverflow link might help
> > > >
> > > >
> > > >
> > >
> >
> https://stackoverflow.com/questions/73240494/can-someone-assist-me-with-a-multi-word-synonym-problem-in-lucene
> > > >
> > > > The following thread seems to confirm, that escaping the space with a
> > > > backslash does not help
> > > >
> > > > https://lists.apache.org/list?java-user@lucene.apache.org:2022-3
> > > >
> > > > HTH
> > > >
> > > > Michael
> > > >
> > > >
> > > > Am 27.12.22 um 20:22 schrieb Anh D?ng Bùi:
> > > > > Hi Lucene users,
> > > > >
> > > > > I recently came across SynonymQuery and found out that it only
> > supports
> > > > > single-term synonyms (since it accepts a list of Term which will be
> > > > > considered as synonyms). We have some multi-term synonyms like
> > > "internet
> > > > > device" <-> "wifi router" or "dns" <-> "domain name service". Am I
> > > right
> > > > > that I need to use something like a BooleanQuery for these cases?
> > > > >
> > > > > I have 2 other follow-up questions:
> > > > > - Does SynonymQuery have any advantage over BooleanQuery? Or is it
> > only
> > > > > different in how scores are computed? As I understand SynonymWeight
> > > will
> > > > > consider all terms as exactly the same while BooleanQuery will
> favor
> > > the
> > > > > documents with more matched terms.
> > > > > - Is it worth it to support multi-term synonyms in SynonymQuery? My
> > > > feeling
> > > > > is that it's better to just use BooleanQuery in those cases, since
> to
> > > > > support multi-term synonyms it needs to accept a list of Query,
> which
> > > > would
> > > > > make it behave like a BooleanQuery. Also how scoring works with
> > > > multi-term
> > > > > is another problem.
> > > > >
> > > > > Thanks & Regards!
> > > > >
> > > >
> > > >
> > > > ---------------------------------------------------------------------
> > > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > > >
> > > >
> > >
> >
> >
> > --
> > Sincerely yours
> > Mikhail Khludnev
> > https://t.me/MUST_SEARCH
> > A caveat: Cyrillic!
> >
>


--
Sincerely yours
Mikhail Khludnev
https://t.me/MUST_SEARCH
A caveat: Cyrillic!