Mailing List Archive

Correct usage of synonyms with Japanese
Hello,

I'm working on a project that involves search in Japanese and uses
synonyms. The Japanese tokenizer creates an analysis graph, but the
SynonymGraphFilter states it cannot take a graph as input. After a few
tests I've seen it can create some unusual outputs if given a graph as
input. The SynonymFilter is marked deprecated, and has documentation
pointing out it doesn't handle multiple synonym paths correctly.

My question is what is the 'correct' way to handle synonyms with Japanese
in Lucene? should the graph be flattened before the SynonymGraphFilter,
then flattened again after? This seems extra lossy. Is the correct answer
to make SynonymGraphFilter accept graphs as inputs? is there another option
that I'm missing?

thanks,
Geoff
Re: Correct usage of synonyms with Japanese [ In reply to ]
Hi Geoffrey,

[.Disclaimer: Geoffrey and I both work at Amazon on customer-facing product
search]

We absolutely must get SynonymGraphFilter consuming input graphs! This is
just a (serious) bug in it! But it's just software, let's fix it :) That
is clearly the right fix, it is just rather fun and challenging. But it is
doable. Could you open an issue? I thought we had one for this but cannot
find it now.

I think you are using Kuromoji Japanese tokenizer? Which produces nice
looking graphs right from the get-go (tokenizer), with compound words also
properly decompounded so both options are indexed/searched.

History: we created SynonymGraphFilter, along with other important
QueryParser (e.g. http://issues.apache.org/jira/browse/LUCENE-7603) and
Query improvements, to get multi-term synonyms working correctly, finally
in Lucene. With the old SynonymFilter, positional queries involving
multi-term synonyms would have both false positive and false negative hits
... I tried to explain the messy situation here:
http://blog.mikemccandless.com/2012/04/lucenes-tokenstreams-are-actually.html

And, finally, with SynonymGraphFilter, used only at search time, and with
tokens consumed by a QueryParser that knows how to turn graphs into correct
positional queries, those bugs are finally fixed -- multi-term synonyms
work correctly.

When used during indexing, SynonymGraphFilter must eventually be followed
by FlattenGraphFilter, because Lucene's index does not store the posLength
attribute of each token. I.e., the graph is lost anyways during indexing,
so FlattenGraphFilter tries to flatten the graph in the most
information-preserving way (but still loses information, resulting in false
positive/negative hits for positional queries).

Anyways, until we fix this, feeding a graph to SynonymGraphFilter will
indeed mess up its output in weird ways.

This problem has come up several times recently, e.g.
https://issues.apache.org/jira/browse/LUCENE-9173 and
https://issues.apache.org/jira/browse/LUCENE-9123. There is also the more
revolutionary https://issues.apache.org/jira/browse/LUCENE-5012 but that is
too ambitious for this "small bug", I think.

SynonymGraphFilter also struggles with holes, since they might break the
token graph into two: https://issues.apache.org/jira/browse/LUCENE-8985

For short term workarounds, some possible ideas:

* I think Kuromoji has an option to NOT produce the compounds/graph
output? It has an indexing and searching mode. That might be one
workaround, if maybe you could maybe then move the compounding into
SynonymGraphFilter? I'm not sure that is possible, in general, since
Kuromoji is using more powerful information (dictionary) to make its graph
choices.

* Use FlattenGraphFilter immediately before SynonymGraphFilter, and then
again at the end of your analysis chain. This loses information, since all
tokens are "squashed" onto one another, and we could no longer tell which
sequence of tokens corresponded to which compound word, and it might mean
some synonyms fail to apply when they should have.

* Go back to SynonymFilter at indexing time. It will also not fully
handle an input graph correctly, and will necessarily miss some synonyms
that should've applied, but it may produce a more "reasonable" bad output,
and then you shouldn't need FlattenGraphFilter at all. But test this
carefully to understand what it is doing!

But let's fix the issue for real!

Mike McCandless

http://blog.mikemccandless.com


On Tue, May 18, 2021 at 6:17 AM Geoffrey Lawson <geoffrey.lawson0@gmail.com>
wrote:

> Hello,
>
> I'm working on a project that involves search in Japanese and uses
> synonyms. The Japanese tokenizer creates an analysis graph, but the
> SynonymGraphFilter states it cannot take a graph as input. After a few
> tests I've seen it can create some unusual outputs if given a graph as
> input. The SynonymFilter is marked deprecated, and has documentation
> pointing out it doesn't handle multiple synonym paths correctly.
>
> My question is what is the 'correct' way to handle synonyms with Japanese
> in Lucene? should the graph be flattened before the SynonymGraphFilter,
> then flattened again after? This seems extra lossy. Is the correct answer
> to make SynonymGraphFilter accept graphs as inputs? is there another option
> that I'm missing?
>
> thanks,
> Geoff
>
Re: Correct usage of synonyms with Japanese [ In reply to ]
Thanks for the background Mike!

I am using the kuromoji tokenizer. Using discardCompoundToken is a good
point. I had not considered that.

For fixing the issue I've created a Jira ticket for it here:
https://issues.apache.org/jira/browse/LUCENE-9966.

geoff

On Tue, May 18, 2021 at 11:07 PM Michael McCandless <
lucene@mikemccandless.com> wrote:

> Hi Geoffrey,
>
> [.Disclaimer: Geoffrey and I both work at Amazon on customer-facing product
> search]
>
> We absolutely must get SynonymGraphFilter consuming input graphs! This is
> just a (serious) bug in it! But it's just software, let's fix it :) That
> is clearly the right fix, it is just rather fun and challenging. But it is
> doable. Could you open an issue? I thought we had one for this but cannot
> find it now.
>
> I think you are using Kuromoji Japanese tokenizer? Which produces nice
> looking graphs right from the get-go (tokenizer), with compound words also
> properly decompounded so both options are indexed/searched.
>
> History: we created SynonymGraphFilter, along with other important
> QueryParser (e.g. http://issues.apache.org/jira/browse/LUCENE-7603) and
> Query improvements, to get multi-term synonyms working correctly, finally
> in Lucene. With the old SynonymFilter, positional queries involving
> multi-term synonyms would have both false positive and false negative hits
> ... I tried to explain the messy situation here:
> http://blog.mikemccandless.com/2012/04/lucenes-tokenstreams-are-actually.html
>
> And, finally, with SynonymGraphFilter, used only at search time, and with
> tokens consumed by a QueryParser that knows how to turn graphs into correct
> positional queries, those bugs are finally fixed -- multi-term synonyms
> work correctly.
>
> When used during indexing, SynonymGraphFilter must eventually be followed
> by FlattenGraphFilter, because Lucene's index does not store the posLength
> attribute of each token. I.e., the graph is lost anyways during indexing,
> so FlattenGraphFilter tries to flatten the graph in the most
> information-preserving way (but still loses information, resulting in false
> positive/negative hits for positional queries).
>
> Anyways, until we fix this, feeding a graph to SynonymGraphFilter will
> indeed mess up its output in weird ways.
>
> This problem has come up several times recently, e.g.
> https://issues.apache.org/jira/browse/LUCENE-9173 and
> https://issues.apache.org/jira/browse/LUCENE-9123. There is also the
> more revolutionary https://issues.apache.org/jira/browse/LUCENE-5012 but
> that is too ambitious for this "small bug", I think.
>
> SynonymGraphFilter also struggles with holes, since they might break the
> token graph into two: https://issues.apache.org/jira/browse/LUCENE-8985
>
> For short term workarounds, some possible ideas:
>
> * I think Kuromoji has an option to NOT produce the compounds/graph
> output? It has an indexing and searching mode. That might be one
> workaround, if maybe you could maybe then move the compounding into
> SynonymGraphFilter? I'm not sure that is possible, in general, since
> Kuromoji is using more powerful information (dictionary) to make its graph
> choices.
>
> * Use FlattenGraphFilter immediately before SynonymGraphFilter, and then
> again at the end of your analysis chain. This loses information, since all
> tokens are "squashed" onto one another, and we could no longer tell which
> sequence of tokens corresponded to which compound word, and it might mean
> some synonyms fail to apply when they should have.
>
> * Go back to SynonymFilter at indexing time. It will also not fully
> handle an input graph correctly, and will necessarily miss some synonyms
> that should've applied, but it may produce a more "reasonable" bad output,
> and then you shouldn't need FlattenGraphFilter at all. But test this
> carefully to understand what it is doing!
>
> But let's fix the issue for real!
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
>
> On Tue, May 18, 2021 at 6:17 AM Geoffrey Lawson <
> geoffrey.lawson0@gmail.com> wrote:
>
>> Hello,
>>
>> I'm working on a project that involves search in Japanese and uses
>> synonyms. The Japanese tokenizer creates an analysis graph, but the
>> SynonymGraphFilter states it cannot take a graph as input. After a few
>> tests I've seen it can create some unusual outputs if given a graph as
>> input. The SynonymFilter is marked deprecated, and has documentation
>> pointing out it doesn't handle multiple synonym paths correctly.
>>
>> My question is what is the 'correct' way to handle synonyms with Japanese
>> in Lucene? should the graph be flattened before the SynonymGraphFilter,
>> then flattened again after? This seems extra lossy. Is the correct answer
>> to make SynonymGraphFilter accept graphs as inputs? is there another
>> option
>> that I'm missing?
>>
>> thanks,
>> Geoff
>>
>