Mailing List Archive

Bewildered by my search results, can anyone explain where I might be going wrong?
Sorry in advance for writing a small novel.



Background: I am indexing and searching technical reference documents, so
the standard language analyzers aren't appropriate. For example, the content
needs to be indexed so that a search for total matches total value,
total[value], and total(value), but a search for total[. only matches the
second of these.



As a first step I wrote a custom analyzer which uses
PatternCaptureGroupTokenFilter to split the token stream into word character
sequences (total, value) and non-word single characters (, ), [, ].



class TechAnalyzer extends Analyzer {

@override

protected TokenStreamComponents createComponents(String fieldname) {

WhitespaceTokenizer src = new WhitespaceTokenizer();

TokenStream result = new LowerCaseFilter(src);

Pattern alphanum = Pattern.compile("(\\w+) <file://w+)> );

Pattern nonalpha = Pattern.compile("(\\W) <file://W)> ");

result = new
PatternCaptureGroupTokenFilter(result,false,alphanum,nonalpha);

return new TokenStreamComponents(src,result);

}

}



I have tested this analyzer on diverse input files to verify that:

"total value" produces 2 tokens: total, value

"total(value)" produces 4 tokens: total, (, value, )

"total[value]" also produces 4 tokens: total, [, value, ]



So this is the analyzer used to build the index:

..

TechAnalyzer analyzer = new TechAnalyzer();

IndexWriterConfig iwc = new IndexWriterConfig(analyzer);

..



and it surely does. I can use Luke to inspect the terms in the index and see
that ( ) [ ] total and value are all present as separate terms.



So as far as I can tell, the indexing is happening as per requirement. Now
for searching, which is where it is going wrong.

If I want to search for the single word total everything is fine. The
problem is if I want to search for "total[".



String queryStr = "total[";

Query q = new QueryParser("text",new TechAnalyzer()).parse(queryStr);

..



This matches far too many documents because the query is being treated as a
synonym which matches either total or [. To confirm this, if I output
q.toString() I see "Synonym([ total)".



If instead I modify the input query so that it searches for a phrase ("total
[") then it appears to be looking for consecutive terms (q.toString() is
"total [") and it comes back with four results. All four matched documents
do indeed have "total[" in them; the trouble is that there are sixteen other
documents that should match as well, and it is not obvious to me why they
aren't being selected.



Using Luke again to find the two tokens total and [ in the documents, I see
the following for the first match:

For "total"

Position Offsets Payload

19

56

56



For "["

Position Offsets Payload

20

56

57



The actual string "total[" is in the document twice, if I inspect it myself.



For a document which Lucene does not match, but which it should, I can see
in Luke

"total"

Position Offsets Payload

78

80



"["

Position Offsets Payload

78

80



Again, if I inspect this document by hand, it contains "total[" twice.



I don't know if the empty offsets and payloads indicate a problem, and I
don't know if the duplicated positions in the first example are a problem
either (although that document is selected correctly!)

What I do know is that there must be something wrong somewhere because
despite sending the correct token stream to the indexer, querying the data
is performing worse than dumping the documents as text and grepping them.



Any pointers gratefully received.



cheers

T
RE: Bewildered by my search results, can anyone explain where I might be going wrong? [ In reply to ]
Unfortunately it looks like my mailer has decided to monkey with the
patterns, sorry about that
Pattern alphanum = Pattern.compile( a pattern that matches one or more
'word' characters );
Pattern nonalpha = Pattern.compile( a pattern that matches any single
'non-word' character );

I forgot to include in my already too long message that I haven't been able
to add my custom analyzer to Luke to test the search side; I can add my jar
file and Luke says "custom analyzer built" but it doesn't offer my analyzer
as an option for use in parsing the query string.

cheers
T

-----Original Message-----
From: Trevor Nicholls <trevor@castingthevoid.com>
Sent: Tuesday, 22 June 2021 08:10
To: java-user@lucene.apache.org
Subject: Bewildered by my search results, can anyone explain where I might
be going wrong?

Sorry in advance for writing a small novel.



Background: I am indexing and searching technical reference documents, so
the standard language analyzers aren't appropriate. For example, the content
needs to be indexed so that a search for total matches total value,
total[value], and total(value), but a search for total[. only matches the
second of these.



As a first step I wrote a custom analyzer which uses
PatternCaptureGroupTokenFilter to split the token stream into word character
sequences (total, value) and non-word single characters (, ), [, ].



class TechAnalyzer extends Analyzer {

@override

protected TokenStreamComponents createComponents(String fieldname) {

WhitespaceTokenizer src = new WhitespaceTokenizer();

TokenStream result = new LowerCaseFilter(src);

Pattern alphanum = Pattern.compile("(\\w+) <file://w+)> );

Pattern nonalpha = Pattern.compile("(\\W) <file://W)> ");

result = new
PatternCaptureGroupTokenFilter(result,false,alphanum,nonalpha);

return new TokenStreamComponents(src,result);

}

}



I have tested this analyzer on diverse input files to verify that:

"total value" produces 2 tokens: total, value

"total(value)" produces 4 tokens: total, (, value, )

"total[value]" also produces 4 tokens: total, [, value, ]



So this is the analyzer used to build the index:

..

TechAnalyzer analyzer = new TechAnalyzer();

IndexWriterConfig iwc = new IndexWriterConfig(analyzer);

..



and it surely does. I can use Luke to inspect the terms in the index and see
that ( ) [ ] total and value are all present as separate terms.



So as far as I can tell, the indexing is happening as per requirement. Now
for searching, which is where it is going wrong.

If I want to search for the single word total everything is fine. The
problem is if I want to search for "total[".



String queryStr = "total[";

Query q = new QueryParser("text",new TechAnalyzer()).parse(queryStr);

..



This matches far too many documents because the query is being treated as a
synonym which matches either total or [. To confirm this, if I output
q.toString() I see "Synonym([ total)".



If instead I modify the input query so that it searches for a phrase ("total
[") then it appears to be looking for consecutive terms (q.toString() is
"total [") and it comes back with four results. All four matched documents
do indeed have "total[" in them; the trouble is that there are sixteen other
documents that should match as well, and it is not obvious to me why they
aren't being selected.



Using Luke again to find the two tokens total and [ in the documents, I see
the following for the first match:

For "total"

Position Offsets Payload

19

56

56



For "["

Position Offsets Payload

20

56

57



The actual string "total[" is in the document twice, if I inspect it myself.



For a document which Lucene does not match, but which it should, I can see
in Luke

"total"

Position Offsets Payload

78

80



"["

Position Offsets Payload

78

80



Again, if I inspect this document by hand, it contains "total[" twice.



I don't know if the empty offsets and payloads indicate a problem, and I
don't know if the duplicated positions in the first example are a problem
either (although that document is selected correctly!)

What I do know is that there must be something wrong somewhere because
despite sending the correct token stream to the indexer, querying the data
is performing worse than dumping the documents as text and grepping them.



Any pointers gratefully received.



cheers

T





---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org