Mailing List Archive: synonym question

synonym question

Mar 14, 2022, 7:54 AM

Post #1 of 4 (428 views)

I have technical data which I am querying with Lucene; one of the features
of the content is that a large number of technical terms may be written as
multiple words or as a compound word. For example, ISOWEEK or ISO WEEK. Or
SynonymFilter or synonym filter.

I have a synonym table which includes all of these phrases, thus isoweek=iso
week, isoyear=iso year, etc.

My understanding is that including the synonyms (with a SynonymFilter in my
analyzer) at index time means that I shouldn't have to include the synonym
filter in the query analyzer because if any of the synonyms appear in a
query they will match records containing any of the synonymous terms, as all
values are indexed for any one of them.

Checking with Luke, this appears to be the case, however the queries are not
matching all the records I expect them too, so I am taking a deeper look.

In the indexing phase, input text is tokenised on whitespace and
punctuation, lowercased, and then processed by a synonym filter. The
relevant part of the analyzer is this:

@Override

protected TokenStreamComponents createComponents(String fieldName) {

WhitespaceTokenizer src = new WhitespaceTokenizer();

TokenStream result = new TechTokenFilter( new LowerCaseFilter(src));

result = new SynonymGraphFilter(result,
getSynonyms(options.getSynonymsList()), Boolean.TRUE);

result = new FlattenGraphFilter(result);

}

return new TokenStreamComponents(src, result);

The getSynonyms method builds a synonym map from a comma-delimited text file
and I know this is working because all the one-word synonym replacements
index and search perfectly. The problem I have is with synonym phrases.

So if the synonyms input file contains

isoweek,isodate

then (using Luke) I can see that any document containing either 'isoweek' or
'isodate' has indexed both terms, and a search with either term returns
matching results for both. Great.

However if the input file contains

isoweek,iso week

then (again using Luke) I can see that while any document containing
'isoweek' has indexed the terms 'isoweek', 'iso' and 'week', unfortunately
any document containing 'iso week' has only indexed 'iso' and 'week'.

Am I chasing the impossible here? Is there something I can do in the query
analyzer to make it work? (Currently the query analyzer is the same as the
indexing analyzer with the synonymgraphfilter and flattengraphfilter
omitted.) Or do I have to manually pre-process any query to include OR
options for all phrase synonyms?

I haven't produced a small test case for this because I'm hoping a high
level discussion is all I need to put me on the right track.

cheers

T

Re: synonym question [ In reply to ]

bernd.fehling at uni-bielefeld

Mar 14, 2022, 8:15 AM

Post #2 of 4 (428 views)

Permalink

Hello,

just a guess, have you tried escaping the space in your multi-word terms
with backslash?

isoweek,iso\ week

Regards
Bernd

Am 14.03.22 um 15:54 schrieb Trevor Nicholls:
> I have technical data which I am querying with Lucene; one of the features
> of the content is that a large number of technical terms may be written as
> multiple words or as a compound word. For example, ISOWEEK or ISO WEEK. Or
> SynonymFilter or synonym filter.
>
>
>
> I have a synonym table which includes all of these phrases, thus isoweek=iso
> week, isoyear=iso year, etc.
>
>
>
> My understanding is that including the synonyms (with a SynonymFilter in my
> analyzer) at index time means that I shouldn't have to include the synonym
> filter in the query analyzer because if any of the synonyms appear in a
> query they will match records containing any of the synonymous terms, as all
> values are indexed for any one of them.
>
>
>
> Checking with Luke, this appears to be the case, however the queries are not
> matching all the records I expect them too, so I am taking a deeper look.
>
>
>
> In the indexing phase, input text is tokenised on whitespace and
> punctuation, lowercased, and then processed by a synonym filter. The
> relevant part of the analyzer is this:
>
>
>
> @Override
>
> protected TokenStreamComponents createComponents(String fieldName) {
>
> WhitespaceTokenizer src = new WhitespaceTokenizer();
>
> TokenStream result = new TechTokenFilter( new LowerCaseFilter(src));
>
> result = new SynonymGraphFilter(result,
> getSynonyms(options.getSynonymsList()), Boolean.TRUE);
>
> result = new FlattenGraphFilter(result);
>
> }
>
> return new TokenStreamComponents(src, result);
>
>
>
> The getSynonyms method builds a synonym map from a comma-delimited text file
> and I know this is working because all the one-word synonym replacements
> index and search perfectly. The problem I have is with synonym phrases.
>
>
>
> So if the synonyms input file contains
>
>
>
> isoweek,isodate
>
>
>
> then (using Luke) I can see that any document containing either 'isoweek' or
> 'isodate' has indexed both terms, and a search with either term returns
> matching results for both. Great.
>
>
>
> However if the input file contains
>
>
>
> isoweek,iso week
>
>
>
> then (again using Luke) I can see that while any document containing
> 'isoweek' has indexed the terms 'isoweek', 'iso' and 'week', unfortunately
> any document containing 'iso week' has only indexed 'iso' and 'week'.
>
>
>
> Am I chasing the impossible here? Is there something I can do in the query
> analyzer to make it work? (Currently the query analyzer is the same as the
> indexing analyzer with the synonymgraphfilter and flattengraphfilter
> omitted.) Or do I have to manually pre-process any query to include OR
> options for all phrase synonyms?
>
>
>
> I haven't produced a small test case for this because I'm hoping a high
> level discussion is all I need to put me on the right track.
>
>
>
> cheers
>
> T
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

RE: synonym question [ In reply to ]

trevor at castingthevoid

Mar 14, 2022, 9:02 AM

Post #3 of 4 (428 views)

Permalink

Hi, thanks for such a quick response!

No I hadn't thought of that. In how many of the following would I need to do this:
- synonym map creation
- analyzing text for indexing
- analyzing text for querying

If either of the latter two then I can see lots of complications ensuing; it more or less makes a synonym map redundant if I have to manually parse the text and identify all the potential synonyms in advance. I may be missing something critical, of course.

cheers
T

-----Original Message-----
From: Bernd Fehling <bernd.fehling@uni-bielefeld.de>
Sent: Tuesday, 15 March 2022 04:16
To: java-user@lucene.apache.org
Subject: Re: synonym question

Hello,

just a guess, have you tried escaping the space in your multi-word terms with backslash?

isoweek,iso\ week

Regards
Bernd

Am 14.03.22 um 15:54 schrieb Trevor Nicholls:
> I have technical data which I am querying with Lucene; one of the
> features of the content is that a large number of technical terms may
> be written as multiple words or as a compound word. For example,
> ISOWEEK or ISO WEEK. Or SynonymFilter or synonym filter.
>
>
>
> I have a synonym table which includes all of these phrases, thus
> isoweek=iso week, isoyear=iso year, etc.
>
>
>
> My understanding is that including the synonyms (with a SynonymFilter
> in my
> analyzer) at index time means that I shouldn't have to include the
> synonym filter in the query analyzer because if any of the synonyms
> appear in a query they will match records containing any of the
> synonymous terms, as all values are indexed for any one of them.
>
>
>
> Checking with Luke, this appears to be the case, however the queries
> are not matching all the records I expect them too, so I am taking a deeper look.
>
>
>
> In the indexing phase, input text is tokenised on whitespace and
> punctuation, lowercased, and then processed by a synonym filter. The
> relevant part of the analyzer is this:
>
>
>
> @Override
>
> protected TokenStreamComponents createComponents(String fieldName)
> {
>
> WhitespaceTokenizer src = new WhitespaceTokenizer();
>
> TokenStream result = new TechTokenFilter( new
> LowerCaseFilter(src));
>
> result = new SynonymGraphFilter(result,
> getSynonyms(options.getSynonymsList()), Boolean.TRUE);
>
> result = new FlattenGraphFilter(result);
>
> }
>
> return new TokenStreamComponents(src, result);
>
>
>
> The getSynonyms method builds a synonym map from a comma-delimited
> text file and I know this is working because all the one-word synonym
> replacements index and search perfectly. The problem I have is with synonym phrases.
>
>
>
> So if the synonyms input file contains
>
>
>
> isoweek,isodate
>
>
>
> then (using Luke) I can see that any document containing either
> 'isoweek' or 'isodate' has indexed both terms, and a search with
> either term returns matching results for both. Great.
>
>
>
> However if the input file contains
>
>
>
> isoweek,iso week
>
>
>
> then (again using Luke) I can see that while any document containing
> 'isoweek' has indexed the terms 'isoweek', 'iso' and 'week',
> unfortunately any document containing 'iso week' has only indexed 'iso' and 'week'.
>
>
>
> Am I chasing the impossible here? Is there something I can do in the
> query analyzer to make it work? (Currently the query analyzer is the
> same as the indexing analyzer with the synonymgraphfilter and
> flattengraphfilter
> omitted.) Or do I have to manually pre-process any query to include OR
> options for all phrase synonyms?
>
>
>
> I haven't produced a small test case for this because I'm hoping a
> high level discussion is all I need to put me on the right track.
>
>
>
> cheers
>
> T
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

RE: synonym question [ In reply to ]

trevor at castingthevoid

Mar 14, 2022, 11:11 AM

Post #4 of 4 (428 views)

Permalink

Just to confirm, escaping the spaces in synonym table construction, query construction, or both, does not solve the problem.

-----Original Message-----
From: Trevor Nicholls <trevor@castingthevoid.com>
Sent: Tuesday, 15 March 2022 05:02
To: java-user@lucene.apache.org
Subject: RE: synonym question

Hi, thanks for such a quick response!

No I hadn't thought of that. In how many of the following would I need to do this:
- synonym map creation
- analyzing text for indexing
- analyzing text for querying

If either of the latter two then I can see lots of complications ensuing; it more or less makes a synonym map redundant if I have to manually parse the text and identify all the potential synonyms in advance. I may be missing something critical, of course.

cheers
T

-----Original Message-----
From: Bernd Fehling <bernd.fehling@uni-bielefeld.de>
Sent: Tuesday, 15 March 2022 04:16
To: java-user@lucene.apache.org
Subject: Re: synonym question

Hello,

just a guess, have you tried escaping the space in your multi-word terms with backslash?

isoweek,iso\ week

Regards
Bernd

Am 14.03.22 um 15:54 schrieb Trevor Nicholls:
> I have technical data which I am querying with Lucene; one of the
> features of the content is that a large number of technical terms may
> be written as multiple words or as a compound word. For example,
> ISOWEEK or ISO WEEK. Or SynonymFilter or synonym filter.
>
>
>
> I have a synonym table which includes all of these phrases, thus
> isoweek=iso week, isoyear=iso year, etc.
>
>
>
> My understanding is that including the synonyms (with a SynonymFilter
> in my
> analyzer) at index time means that I shouldn't have to include the
> synonym filter in the query analyzer because if any of the synonyms
> appear in a query they will match records containing any of the
> synonymous terms, as all values are indexed for any one of them.
>
>
>
> Checking with Luke, this appears to be the case, however the queries
> are not matching all the records I expect them too, so I am taking a deeper look.
>
>
>
> In the indexing phase, input text is tokenised on whitespace and
> punctuation, lowercased, and then processed by a synonym filter. The
> relevant part of the analyzer is this:
>
>
>
> @Override
>
> protected TokenStreamComponents createComponents(String fieldName)
> {
>
> WhitespaceTokenizer src = new WhitespaceTokenizer();
>
> TokenStream result = new TechTokenFilter( new
> LowerCaseFilter(src));
>
> result = new SynonymGraphFilter(result,
> getSynonyms(options.getSynonymsList()), Boolean.TRUE);
>
> result = new FlattenGraphFilter(result);
>
> }
>
> return new TokenStreamComponents(src, result);
>
>
>
> The getSynonyms method builds a synonym map from a comma-delimited
> text file and I know this is working because all the one-word synonym
> replacements index and search perfectly. The problem I have is with synonym phrases.
>
>
>
> So if the synonyms input file contains
>
>
>
> isoweek,isodate
>
>
>
> then (using Luke) I can see that any document containing either
> 'isoweek' or 'isodate' has indexed both terms, and a search with
> either term returns matching results for both. Great.
>
>
>
> However if the input file contains
>
>
>
> isoweek,iso week
>
>
>
> then (again using Luke) I can see that while any document containing
> 'isoweek' has indexed the terms 'isoweek', 'iso' and 'week',
> unfortunately any document containing 'iso week' has only indexed 'iso' and 'week'.
>
>
>
> Am I chasing the impossible here? Is there something I can do in the
> query analyzer to make it work? (Currently the query analyzer is the
> same as the indexing analyzer with the synonymgraphfilter and
> flattengraphfilter
> omitted.) Or do I have to manually pre-process any query to include OR
> options for all phrase synonyms?
>
>
>
> I haven't produced a small test case for this because I'm hoping a
> high level discussion is all I need to put me on the right track.
>
>
>
> cheers
>
> T
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org