Mailing List Archive: Question about PhraseQuery's capacity...

Question about PhraseQuery's capacity...

ctengctsh at gmail

Jan 10, 2020, 12:24 AM

Post #1 of 7 (1098 views)

I use SmartChineseAnalyzer to do the indexing, and add a document with a
TextField whose value is a long sentence, when anaylized, will get 18 terms.

& then i use the same value to construct a PhraseQuery, setting slop to 2,
and adding the 18 terms concequently...

I expect the search api to find this document, but it returns empty.

Where am i wrong?

Re: Question about PhraseQuery's capacity... [ In reply to ]

jpountz at gmail

Jan 10, 2020, 12:52 AM

Post #2 of 7 (1098 views)

It should match. My guess is that you might not reusing the same positions
as set by the analysis chain when creating the phrase query? Can you show
us how you build the phrase query?

On Fri, Jan 10, 2020 at 9:24 AM ??? <ctengctsh@gmail.com> wrote:

> I use SmartChineseAnalyzer to do the indexing, and add a document with a
> TextField whose value is a long sentence, when anaylized, will get 18
> terms.
>
> & then i use the same value to construct a PhraseQuery, setting slop to 2,
> and adding the 18 terms concequently...
>
> I expect the search api to find this document, but it returns empty.
>
> Where am i wrong?
>

--
Adrien

Re: Question about PhraseQuery's capacity... [ In reply to ]

ctengctsh at gmail

Jan 10, 2020, 1:09 AM

Post #3 of 7 (1098 views)

Hi Adrien,
I find i might make a mistake:
There is 2 level processing in a Analyzer class: one is Tokenizer,
which is HMMChineseTokenizer, and the other is Analyzer which may apply
some filtering...
I'm using lucene's default interface to set a Analyzer instance to do
the indexing, but i'm using the Tokenizer to parse raw query text to build
the Query.
The wierd thing is, there is a lucene query-parser module, but it will
deal with some meta syntax like AND/OR filedName:xxx, so i think it cannot
directly deal with the raw query text?
But when i try to use the upper Analyzer.tokenStream() to parse
separate terms from raw query text, i get the very confusing api:
TokenStream has no clear interface to get the terms(filtered tokens), but
the Attribute concept, which is used only in lucene internals. Where can i
find a sample code to extract the filtered tokens from the TokenStream
interface?

Adrien Grand <jpountz@gmail.com> ?2020?1?10??? ??4:53???

> It should match. My guess is that you might not reusing the same positions
> as set by the analysis chain when creating the phrase query? Can you show
> us how you build the phrase query?
>
> On Fri, Jan 10, 2020 at 9:24 AM ??? <ctengctsh@gmail.com> wrote:
>
> > I use SmartChineseAnalyzer to do the indexing, and add a document with a
> > TextField whose value is a long sentence, when anaylized, will get 18
> > terms.
> >
> > & then i use the same value to construct a PhraseQuery, setting slop to
> 2,
> > and adding the 18 terms concequently...
> >
> > I expect the search api to find this document, but it returns empty.
> >
> > Where am i wrong?
> >
>
>
> --
> Adrien
>

Re: Question about PhraseQuery's capacity... [ In reply to ]

ctengctsh at gmail

Jan 10, 2020, 2:13 AM

Post #4 of 7 (1098 views)

After i directly call Analyzer.tokenStream() method to extract terms from
query, i still cannot get results. Doesn't know the why...

Code when build index:
IndexWriterConfig iwc = new IndexWriterConfig(analyzer); //new
SmartChineseAnalyzer();

Code do query:
(1) extract terms from query text:

public List<String> analysis(String fieldName, String text) {
List<String> terms = new ArrayList<String>();
TokenStream stream = analyzer.tokenStream(fieldName, text);
try {
stream.reset();
while(stream.incrementToken()) {
CharTermAttribute termAtt = stream.getAttribute(CharTermAttribute.class);
String term = termAtt.toString();
terms.add(term);
}
stream.end();
} catch (IOException e) {
e.printStackTrace();
log.error(e.getMessage(), e);
}
return terms;
}

(2) Code to construct a PhraseQuery:

private Query buildPhraseQuery(Analyzer analyzer, String fieldName, String
queryText, int slop) {
PhraseQuery.Builder builder = new PhraseQuery.Builder();
builder.setSlop(2); //? max is 2;
List<String> terms = analyzer.analysis(fieldName, queryText);
for(String termKeyword : terms) {
Term term = new Term(fieldName, termKeyword);
builder.add(term);
}
Query query = builder.build();
return query;
}

Use BooleanQuery also failed:

private Query buildBooleanANDQuery(Analyzer analyzer, String fieldName,
String queryText) {
BooleanQuery.Builder builder = new BooleanQuery.Builder();
List<String> terms = analyzer.analysis(fieldName, queryText);
log.info("terms: "+StringUtils.join(terms, ", "));
for(String termKeyword : terms) {
Term term = new Term(fieldName, termKeyword);
builder.add(new TermQuery(term), BooleanClause.Occur.MUST);
}
return builder.build();
}

Adrien Grand <jpountz@gmail.com> ?2020?1?10??? ??4:53???

> It should match. My guess is that you might not reusing the same positions
> as set by the analysis chain when creating the phrase query? Can you show
> us how you build the phrase query?
>
> On Fri, Jan 10, 2020 at 9:24 AM ??? <ctengctsh@gmail.com> wrote:
>
> > I use SmartChineseAnalyzer to do the indexing, and add a document with a
> > TextField whose value is a long sentence, when anaylized, will get 18
> > terms.
> >
> > & then i use the same value to construct a PhraseQuery, setting slop to
> 2,
> > and adding the 18 terms concequently...
> >
> > I expect the search api to find this document, but it returns empty.
> >
> > Where am i wrong?
> >
>
>
> --
> Adrien
>

Re: Question about PhraseQuery's capacity... [ In reply to ]

Jan 10, 2020, 2:21 AM

Post #5 of 7 (1097 views)

Hello,
Sometimes IndexSearcher.explain(Query, int) allows to analyse mismatches.

On Fri, Jan 10, 2020 at 1:13 PM ??? <ctengctsh@gmail.com> wrote:

> After i directly call Analyzer.tokenStream() method to extract terms from
> query, i still cannot get results. Doesn't know the why...
>
> Code when build index:
> IndexWriterConfig iwc = new IndexWriterConfig(analyzer); //new
> SmartChineseAnalyzer();
>
> Code do query:
> (1) extract terms from query text:
>
> public List<String> analysis(String fieldName, String text) {
> List<String> terms = new ArrayList<String>();
> TokenStream stream = analyzer.tokenStream(fieldName, text);
> try {
> stream.reset();
> while(stream.incrementToken()) {
> CharTermAttribute termAtt = stream.getAttribute(CharTermAttribute.class);
> String term = termAtt.toString();
> terms.add(term);
> }
> stream.end();
> } catch (IOException e) {
> e.printStackTrace();
> log.error(e.getMessage(), e);
> }
> return terms;
> }
>
> (2) Code to construct a PhraseQuery:
>
> private Query buildPhraseQuery(Analyzer analyzer, String fieldName, String
> queryText, int slop) {
> PhraseQuery.Builder builder = new PhraseQuery.Builder();
> builder.setSlop(2); //? max is 2;
> List<String> terms = analyzer.analysis(fieldName, queryText);
> for(String termKeyword : terms) {
> Term term = new Term(fieldName, termKeyword);
> builder.add(term);
> }
> Query query = builder.build();
> return query;
> }
>
> Use BooleanQuery also failed:
>
> private Query buildBooleanANDQuery(Analyzer analyzer, String fieldName,
> String queryText) {
> BooleanQuery.Builder builder = new BooleanQuery.Builder();
> List<String> terms = analyzer.analysis(fieldName, queryText);
> log.info("terms: "+StringUtils.join(terms, ", "));
> for(String termKeyword : terms) {
> Term term = new Term(fieldName, termKeyword);
> builder.add(new TermQuery(term), BooleanClause.Occur.MUST);
> }
> return builder.build();
> }
>
> Adrien Grand <jpountz@gmail.com> ?2020?1?10??? ??4:53???
>
> > It should match. My guess is that you might not reusing the same
> positions
> > as set by the analysis chain when creating the phrase query? Can you show
> > us how you build the phrase query?
> >
> > On Fri, Jan 10, 2020 at 9:24 AM ??? <ctengctsh@gmail.com> wrote:
> >
> > > I use SmartChineseAnalyzer to do the indexing, and add a document with
> a
> > > TextField whose value is a long sentence, when anaylized, will get 18
> > > terms.
> > >
> > > & then i use the same value to construct a PhraseQuery, setting slop to
> > 2,
> > > and adding the 18 terms concequently...
> > >
> > > I expect the search api to find this document, but it returns empty.
> > >
> > > Where am i wrong?
> > >
> >
> >
> > --
> > Adrien
> >
>

--
Sincerely yours
Mikhail Khludnev

Re: Question about PhraseQuery's capacity... [ In reply to ]

ctengctsh at gmail

Jan 10, 2020, 3:14 AM

Post #6 of 7 (1097 views)

explain api helps! thanks for hint~!
I have found out that one case failed becaused i carelessly add another
filter condition, but the other case (which is analyzed into 30 terms)
still failed, doesn't know why
I guess i need to write a unit testcase to use MultiTerms.getTerms API to
find out if there is any mismatch in analyzer's processing or if there is a
capacity limit in PhraseQuery...

Mikhail Khludnev <mkhl@apache.org> ?2020?1?10??? ??6:21???

> Hello,
> Sometimes IndexSearcher.explain(Query, int) allows to analyse mismatches.
>
> On Fri, Jan 10, 2020 at 1:13 PM ??? <ctengctsh@gmail.com> wrote:
>
> > After i directly call Analyzer.tokenStream() method to extract terms from
> > query, i still cannot get results. Doesn't know the why...
> >
> > Code when build index:
> > IndexWriterConfig iwc = new IndexWriterConfig(analyzer); //new
> > SmartChineseAnalyzer();
> >
> > Code do query:
> > (1) extract terms from query text:
> >
> > public List<String> analysis(String fieldName, String text) {
> > List<String> terms = new ArrayList<String>();
> > TokenStream stream = analyzer.tokenStream(fieldName, text);
> > try {
> > stream.reset();
> > while(stream.incrementToken()) {
> > CharTermAttribute termAtt = stream.getAttribute(CharTermAttribute.class);
> > String term = termAtt.toString();
> > terms.add(term);
> > }
> > stream.end();
> > } catch (IOException e) {
> > e.printStackTrace();
> > log.error(e.getMessage(), e);
> > }
> > return terms;
> > }
> >
> > (2) Code to construct a PhraseQuery:
> >
> > private Query buildPhraseQuery(Analyzer analyzer, String fieldName,
> String
> > queryText, int slop) {
> > PhraseQuery.Builder builder = new PhraseQuery.Builder();
> > builder.setSlop(2); //? max is 2;
> > List<String> terms = analyzer.analysis(fieldName, queryText);
> > for(String termKeyword : terms) {
> > Term term = new Term(fieldName, termKeyword);
> > builder.add(term);
> > }
> > Query query = builder.build();
> > return query;
> > }
> >
> > Use BooleanQuery also failed:
> >
> > private Query buildBooleanANDQuery(Analyzer analyzer, String fieldName,
> > String queryText) {
> > BooleanQuery.Builder builder = new BooleanQuery.Builder();
> > List<String> terms = analyzer.analysis(fieldName, queryText);
> > log.info("terms: "+StringUtils.join(terms, ", "));
> > for(String termKeyword : terms) {
> > Term term = new Term(fieldName, termKeyword);
> > builder.add(new TermQuery(term), BooleanClause.Occur.MUST);
> > }
> > return builder.build();
> > }
> >
> > Adrien Grand <jpountz@gmail.com> ?2020?1?10??? ??4:53???
> >
> > > It should match. My guess is that you might not reusing the same
> > positions
> > > as set by the analysis chain when creating the phrase query? Can you
> show
> > > us how you build the phrase query?
> > >
> > > On Fri, Jan 10, 2020 at 9:24 AM ??? <ctengctsh@gmail.com> wrote:
> > >
> > > > I use SmartChineseAnalyzer to do the indexing, and add a document
> with
> > a
> > > > TextField whose value is a long sentence, when anaylized, will get 18
> > > > terms.
> > > >
> > > > & then i use the same value to construct a PhraseQuery, setting slop
> to
> > > 2,
> > > > and adding the 18 terms concequently...
> > > >
> > > > I expect the search api to find this document, but it returns empty.
> > > >
> > > > Where am i wrong?
> > > >
> > >
> > >
> > > --
> > > Adrien
> > >
> >
>
>
> --
> Sincerely yours
> Mikhail Khludnev
>

Re: Question about PhraseQuery's capacity... [ In reply to ]

ctengctsh at gmail

Jan 12, 2020, 11:59 PM

Post #7 of 7 (1075 views)

hi i have filed a issue to lucene-core:
https://issues.apache.org/jira/browse/LUCENE-9130
i just write a test case, and find that BooelanQuery with MUST filter mode
is ok, but PhraseQuery fails

??? <ctengctsh@gmail.com> ?2020?1?10??? ??7:14???

> explain api helps! thanks for hint~!
> I have found out that one case failed becaused i carelessly add another
> filter condition, but the other case (which is analyzed into 30 terms)
> still failed, doesn't know why
> I guess i need to write a unit testcase to use MultiTerms.getTerms API to
> find out if there is any mismatch in analyzer's processing or if there is a
> capacity limit in PhraseQuery...
>
> Mikhail Khludnev <mkhl@apache.org> ?2020?1?10??? ??6:21???
>
>> Hello,
>> Sometimes IndexSearcher.explain(Query, int) allows to analyse mismatches.
>>
>> On Fri, Jan 10, 2020 at 1:13 PM ??? <ctengctsh@gmail.com> wrote:
>>
>> > After i directly call Analyzer.tokenStream() method to extract terms
>> from
>> > query, i still cannot get results. Doesn't know the why...
>> >
>> > Code when build index:
>> > IndexWriterConfig iwc = new IndexWriterConfig(analyzer);
>> //new
>> > SmartChineseAnalyzer();
>> >
>> > Code do query:
>> > (1) extract terms from query text:
>> >
>> > public List<String> analysis(String fieldName, String text) {
>> > List<String> terms = new ArrayList<String>();
>> > TokenStream stream = analyzer.tokenStream(fieldName, text);
>> > try {
>> > stream.reset();
>> > while(stream.incrementToken()) {
>> > CharTermAttribute termAtt =
>> stream.getAttribute(CharTermAttribute.class);
>> > String term = termAtt.toString();
>> > terms.add(term);
>> > }
>> > stream.end();
>> > } catch (IOException e) {
>> > e.printStackTrace();
>> > log.error(e.getMessage(), e);
>> > }
>> > return terms;
>> > }
>> >
>> > (2) Code to construct a PhraseQuery:
>> >
>> > private Query buildPhraseQuery(Analyzer analyzer, String fieldName,
>> String
>> > queryText, int slop) {
>> > PhraseQuery.Builder builder = new PhraseQuery.Builder();
>> > builder.setSlop(2); //? max is 2;
>> > List<String> terms = analyzer.analysis(fieldName, queryText);
>> > for(String termKeyword : terms) {
>> > Term term = new Term(fieldName, termKeyword);
>> > builder.add(term);
>> > }
>> > Query query = builder.build();
>> > return query;
>> > }
>> >
>> > Use BooleanQuery also failed:
>> >
>> > private Query buildBooleanANDQuery(Analyzer analyzer, String fieldName,
>> > String queryText) {
>> > BooleanQuery.Builder builder = new BooleanQuery.Builder();
>> > List<String> terms = analyzer.analysis(fieldName, queryText);
>> > log.info("terms: "+StringUtils.join(terms, ", "));
>> > for(String termKeyword : terms) {
>> > Term term = new Term(fieldName, termKeyword);
>> > builder.add(new TermQuery(term), BooleanClause.Occur.MUST);
>> > }
>> > return builder.build();
>> > }
>> >
>> > Adrien Grand <jpountz@gmail.com> ?2020?1?10??? ??4:53???
>> >
>> > > It should match. My guess is that you might not reusing the same
>> > positions
>> > > as set by the analysis chain when creating the phrase query? Can you
>> show
>> > > us how you build the phrase query?
>> > >
>> > > On Fri, Jan 10, 2020 at 9:24 AM ??? <ctengctsh@gmail.com> wrote:
>> > >
>> > > > I use SmartChineseAnalyzer to do the indexing, and add a document
>> with
>> > a
>> > > > TextField whose value is a long sentence, when anaylized, will get
>> 18
>> > > > terms.
>> > > >
>> > > > & then i use the same value to construct a PhraseQuery, setting
>> slop to
>> > > 2,
>> > > > and adding the 18 terms concequently...
>> > > >
>> > > > I expect the search api to find this document, but it returns empty.
>> > > >
>> > > > Where am i wrong?
>> > > >
>> > >
>> > >
>> > > --
>> > > Adrien
>> > >
>> >
>>
>>
>> --
>> Sincerely yours
>> Mikhail Khludnev
>>
>