Mailing List Archive

Re: multi-term synonym prevents single-term match -- known issue?
It's time to summon Lucene devs
https://issues.apache.org/jira/browse/SOLR-16652?focusedCommentId=17687998&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17687998

it seems by design
https://github.com/apache/lucene/blob/main/lucene/queryparser/src/test/org/apache/lucene/queryparser/classic/TestQueryParser.java#L591
It sets mw synonym: "guinea pig => cavy"
dumb.parse("guinea pig") => ((+field:guinea +field:pig) field:cavy)
Doesn't match just 'guinea' as expected in this ticket.

On Mon, Feb 13, 2023 at 5:33 PM Rudi Seitz <rudi@rudiseitz.com> wrote:

> Thanks Mikhail.
> I think your directional approach ("foo bar=>baz,foo,bar") would work, but
> we'd also need "baz=>baz,foo bar" for a complete workaround.
> I've added your message as a comment on the ticket.
> Rudi
>
> On Sat, Feb 11, 2023 at 12:34 PM Mikhail Khludnev <mkhl@apache.org> wrote:
>
> > Thanks for raising a ticket. Here are just two considerations:
> > > we could change the synonym rule to "foo bar,baz,foo,bar" but this
> would
> > mean that a query for "foo" could now match a document containing only
> > "bar", which is not the intent of the original rule.
> > Ok. The later issue can be probably fixed by directing synonyms
> > foo bar=>baz,foo,bar
> > Right, It seems like a weird band aid.
> >
> > I stepped through lucene code, MUST occur for synonyms is defined
> >
> >
> https://github.com/apache/lucene/blob/7baa01b3c2f93e6b172e986aac8ef577a87ebceb/lucene/core/src/java/org/apache/lucene/util/QueryBuilder.java#L534
> > Presumably, original terms could go with defaultOperator, and synonym
> > replacement keep MUST.
> >
> >
> >
> >
> >
> > On Sat, Feb 11, 2023 at 12:17 AM Rudi Seitz <rudi@rudiseitz.com> wrote:
> >
> > > Thanks Mikhail and Michael.
> > > Based on your feedback, I created a ticket:
> > > https://issues.apache.org/jira/browse/SOLR-16652
> > > In the ticket, I mentioned why updating the synonym rule or setting
> > > sow=true causes other problems in this case, unfortunately. I haven't
> yet
> > > looked through code to see where the behavior could be changed.
> > > Rudi
> > >
> > >
> > > On Fri, Feb 10, 2023 at 11:26 AM Michael Gibney <
> > michael@michaelgibney.net
> > > >
> > > wrote:
> > >
> > > > Rudi,
> > > >
> > > > I agree, this does not seem like how it should behave. Probably
> > > > something that could be fixed in edismax, not something lower-level
> > > > (Lucene)?
> > > >
> > > > Michael
> > > >
> > > > On Fri, Feb 10, 2023 at 9:38 AM Mikhail Khludnev <mkhl@apache.org>
> > > wrote:
> > > > >
> > > > > Hello, Rudi.
> > > > > Well, it doesn't seem perfect. Probably it's can be fixed
> > > > > via
> > > > > foo bar,zzz,foo,bar
> > > > > And in some sort of sense this behavior is reasonable.
> > > > > Also you can experiment with sow and pf params (the later param is
> > > > > described in dismax page only).
> > > > >
> > > > > On Thu, Feb 9, 2023 at 8:19 PM Rudi Seitz <rudi@rudiseitz.com>
> > wrote:
> > > > >
> > > > > > Is this known behavior or is it worth a JIRA ticket?
> > > > > >
> > > > > > Searching against a text_general field in Solr 9.1, if my edismax
> > > > query is
> > > > > > "foo bar" I should be able to get matches for "foo" without "bar"
> > and
> > > > vice
> > > > > > versa. However, if there happens to be a synonym rule applied at
> > > query
> > > > > > time, like "foo bar,zzz" I can no longer get single-term matches
> > > > against
> > > > > > "foo" or "bar." Both terms are now required, but can occur in
> > either
> > > > order.
> > > > > > If we change the text_general analysis chain to apply synonyms at
> > > index
> > > > > > time instead of query time, this behavior goes away and
> single-term
> > > > matches
> > > > > > are again possible.
> > > > > >
> > > > > > To reproduce, use the _default configset with "foo bar,zzz" added
> > to
> > > > > > synonyms.txt. Index these four docs:
> > > > > >
> > > > > > {"id":"1", "title_txt":"foo"}
> > > > > > {"id":"2", "title_txt":"bar"}
> > > > > > {"id":"3", "title_txt":"foo bar"}
> > > > > > {"id":"4", "title_txt":"bar foo"}
> > > > > >
> > > > > > Issue a query for "foo bar" (i.e.
> > > > > > defType=edismax&q.op=OR&qf=title_txt&q=foo bar)
> > > > > > Result: Only docs 3 and 4 come back
> > > > > >
> > > > > > Issue a query for "bar foo"
> > > > > > Result: All four docs come back; the synonym rule is not invoked
> > > > > >
> > > > > > Looking at the explain output for "foo bar" we see:
> > > > > >
> > > > > > +((title_txt:zzz (+title_txt:foo +title_txt:bar)))
> > > > > >
> > > > > >
> > > > > > Looking at the explain output for "bar foo" we see:
> > > > > >
> > > > > > +((title_txt:bar) (title_txt:foo))
> > > > > >
> > > > > > So, the observed behavior makes sense according to the low-level
> > > query
> > > > > > structure. But -- is this how it's "supposed" to work?
> > > > > >
> > > > > > Why not expand the "foo bar" query like this instead?
> > > > > >
> > > > > > +((title_txt:zzz (title_txt:foo title_txt:bar)))
> > > > > >
> > > > > > Rudi
> > > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Sincerely yours
> > > > > Mikhail Khludnev
> > > > > https://t.me/MUST_SEARCH
> > > > > A caveat: Cyrillic!
> > > >
> > >
> >
> >
> > --
> > Sincerely yours
> > Mikhail Khludnev
> > https://t.me/MUST_SEARCH
> > A caveat: Cyrillic!
> >
>


--
Sincerely yours
Mikhail Khludnev
https://t.me/MUST_SEARCH
A caveat: Cyrillic!
Re: multi-term synonym prevents single-term match -- known issue? [ In reply to ]
Opened reproducer https://github.com/apache/lucene/pull/12157

On Mon, Feb 13, 2023 at 6:46 PM Mikhail Khludnev <mkhl@apache.org> wrote:

> It's time to summon Lucene devs
> https://issues.apache.org/jira/browse/SOLR-16652?focusedCommentId=17687998&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17687998
>
> it seems by design
> https://github.com/apache/lucene/blob/main/lucene/queryparser/src/test/org/apache/lucene/queryparser/classic/TestQueryParser.java#L591
> It sets mw synonym: "guinea pig => cavy"
> dumb.parse("guinea pig") => ((+field:guinea +field:pig) field:cavy)
> Doesn't match just 'guinea' as expected in this ticket.
>
> On Mon, Feb 13, 2023 at 5:33 PM Rudi Seitz <rudi@rudiseitz.com> wrote:
>
>> Thanks Mikhail.
>> I think your directional approach ("foo bar=>baz,foo,bar") would work, but
>> we'd also need "baz=>baz,foo bar" for a complete workaround.
>> I've added your message as a comment on the ticket.
>> Rudi
>>
>> On Sat, Feb 11, 2023 at 12:34 PM Mikhail Khludnev <mkhl@apache.org>
>> wrote:
>>
>> > Thanks for raising a ticket. Here are just two considerations:
>> > > we could change the synonym rule to "foo bar,baz,foo,bar" but this
>> would
>> > mean that a query for "foo" could now match a document containing only
>> > "bar", which is not the intent of the original rule.
>> > Ok. The later issue can be probably fixed by directing synonyms
>> > foo bar=>baz,foo,bar
>> > Right, It seems like a weird band aid.
>> >
>> > I stepped through lucene code, MUST occur for synonyms is defined
>> >
>> >
>> https://github.com/apache/lucene/blob/7baa01b3c2f93e6b172e986aac8ef577a87ebceb/lucene/core/src/java/org/apache/lucene/util/QueryBuilder.java#L534
>> > Presumably, original terms could go with defaultOperator, and synonym
>> > replacement keep MUST.
>> >
>> >
>> >
>> >
>> >
>> > On Sat, Feb 11, 2023 at 12:17 AM Rudi Seitz <rudi@rudiseitz.com> wrote:
>> >
>> > > Thanks Mikhail and Michael.
>> > > Based on your feedback, I created a ticket:
>> > > https://issues.apache.org/jira/browse/SOLR-16652
>> > > In the ticket, I mentioned why updating the synonym rule or setting
>> > > sow=true causes other problems in this case, unfortunately. I haven't
>> yet
>> > > looked through code to see where the behavior could be changed.
>> > > Rudi
>> > >
>> > >
>> > > On Fri, Feb 10, 2023 at 11:26 AM Michael Gibney <
>> > michael@michaelgibney.net
>> > > >
>> > > wrote:
>> > >
>> > > > Rudi,
>> > > >
>> > > > I agree, this does not seem like how it should behave. Probably
>> > > > something that could be fixed in edismax, not something lower-level
>> > > > (Lucene)?
>> > > >
>> > > > Michael
>> > > >
>> > > > On Fri, Feb 10, 2023 at 9:38 AM Mikhail Khludnev <mkhl@apache.org>
>> > > wrote:
>> > > > >
>> > > > > Hello, Rudi.
>> > > > > Well, it doesn't seem perfect. Probably it's can be fixed
>> > > > > via
>> > > > > foo bar,zzz,foo,bar
>> > > > > And in some sort of sense this behavior is reasonable.
>> > > > > Also you can experiment with sow and pf params (the later param is
>> > > > > described in dismax page only).
>> > > > >
>> > > > > On Thu, Feb 9, 2023 at 8:19 PM Rudi Seitz <rudi@rudiseitz.com>
>> > wrote:
>> > > > >
>> > > > > > Is this known behavior or is it worth a JIRA ticket?
>> > > > > >
>> > > > > > Searching against a text_general field in Solr 9.1, if my
>> edismax
>> > > > query is
>> > > > > > "foo bar" I should be able to get matches for "foo" without
>> "bar"
>> > and
>> > > > vice
>> > > > > > versa. However, if there happens to be a synonym rule applied at
>> > > query
>> > > > > > time, like "foo bar,zzz" I can no longer get single-term matches
>> > > > against
>> > > > > > "foo" or "bar." Both terms are now required, but can occur in
>> > either
>> > > > order.
>> > > > > > If we change the text_general analysis chain to apply synonyms
>> at
>> > > index
>> > > > > > time instead of query time, this behavior goes away and
>> single-term
>> > > > matches
>> > > > > > are again possible.
>> > > > > >
>> > > > > > To reproduce, use the _default configset with "foo bar,zzz"
>> added
>> > to
>> > > > > > synonyms.txt. Index these four docs:
>> > > > > >
>> > > > > > {"id":"1", "title_txt":"foo"}
>> > > > > > {"id":"2", "title_txt":"bar"}
>> > > > > > {"id":"3", "title_txt":"foo bar"}
>> > > > > > {"id":"4", "title_txt":"bar foo"}
>> > > > > >
>> > > > > > Issue a query for "foo bar" (i.e.
>> > > > > > defType=edismax&q.op=OR&qf=title_txt&q=foo bar)
>> > > > > > Result: Only docs 3 and 4 come back
>> > > > > >
>> > > > > > Issue a query for "bar foo"
>> > > > > > Result: All four docs come back; the synonym rule is not invoked
>> > > > > >
>> > > > > > Looking at the explain output for "foo bar" we see:
>> > > > > >
>> > > > > > +((title_txt:zzz (+title_txt:foo +title_txt:bar)))
>> > > > > >
>> > > > > >
>> > > > > > Looking at the explain output for "bar foo" we see:
>> > > > > >
>> > > > > > +((title_txt:bar) (title_txt:foo))
>> > > > > >
>> > > > > > So, the observed behavior makes sense according to the low-level
>> > > query
>> > > > > > structure. But -- is this how it's "supposed" to work?
>> > > > > >
>> > > > > > Why not expand the "foo bar" query like this instead?
>> > > > > >
>> > > > > > +((title_txt:zzz (title_txt:foo title_txt:bar)))
>> > > > > >
>> > > > > > Rudi
>> > > > > >
>> > > > >
>> > > > >
>> > > > > --
>> > > > > Sincerely yours
>> > > > > Mikhail Khludnev
>> > > > > https://t.me/MUST_SEARCH
>> > > > > A caveat: Cyrillic!
>> > > >
>> > >
>> >
>> >
>> > --
>> > Sincerely yours
>> > Mikhail Khludnev
>> > https://t.me/MUST_SEARCH
>> > A caveat: Cyrillic!
>> >
>>
>
>
> --
> Sincerely yours
> Mikhail Khludnev
> https://t.me/MUST_SEARCH
> A caveat: Cyrillic!
>


--
Sincerely yours
Mikhail Khludnev
https://t.me/MUST_SEARCH
A caveat: Cyrillic!