Sorry in advance for writing a small novel.
Background: I am indexing and searching technical reference documents, so
the standard language analyzers aren't appropriate. For example, the content
needs to be indexed so that a search for total matches total value,
total[value], and total(value), but a search for total[. only matches the
second of these.
As a first step I wrote a custom analyzer which uses
PatternCaptureGroupTokenFilter to split the token stream into word character
sequences (total, value) and non-word single characters (, ), [, ].
class TechAnalyzer extends Analyzer {
@override
protected TokenStreamComponents createComponents(String fieldname) {
WhitespaceTokenizer src = new WhitespaceTokenizer();
TokenStream result = new LowerCaseFilter(src);
Pattern alphanum = Pattern.compile("(\\w+) <file://w+)> );
Pattern nonalpha = Pattern.compile("(\\W) <file://W)> ");
result = new
PatternCaptureGroupTokenFilter(result,false,alphanum,nonalpha);
return new TokenStreamComponents(src,result);
}
}
I have tested this analyzer on diverse input files to verify that:
"total value" produces 2 tokens: total, value
"total(value)" produces 4 tokens: total, (, value, )
"total[value]" also produces 4 tokens: total, [, value, ]
So this is the analyzer used to build the index:
..
TechAnalyzer analyzer = new TechAnalyzer();
IndexWriterConfig iwc = new IndexWriterConfig(analyzer);
..
and it surely does. I can use Luke to inspect the terms in the index and see
that ( ) [ ] total and value are all present as separate terms.
So as far as I can tell, the indexing is happening as per requirement. Now
for searching, which is where it is going wrong.
If I want to search for the single word total everything is fine. The
problem is if I want to search for "total[".
String queryStr = "total[";
Query q = new QueryParser("text",new TechAnalyzer()).parse(queryStr);
..
This matches far too many documents because the query is being treated as a
synonym which matches either total or [. To confirm this, if I output
q.toString() I see "Synonym([ total)".
If instead I modify the input query so that it searches for a phrase ("total
[") then it appears to be looking for consecutive terms (q.toString() is
"total [") and it comes back with four results. All four matched documents
do indeed have "total[" in them; the trouble is that there are sixteen other
documents that should match as well, and it is not obvious to me why they
aren't being selected.
Using Luke again to find the two tokens total and [ in the documents, I see
the following for the first match:
For "total"
Position Offsets Payload
19
56
56
For "["
Position Offsets Payload
20
56
57
The actual string "total[" is in the document twice, if I inspect it myself.
For a document which Lucene does not match, but which it should, I can see
in Luke
"total"
Position Offsets Payload
78
80
"["
Position Offsets Payload
78
80
Again, if I inspect this document by hand, it contains "total[" twice.
I don't know if the empty offsets and payloads indicate a problem, and I
don't know if the duplicated positions in the first example are a problem
either (although that document is selected correctly!)
What I do know is that there must be something wrong somewhere because
despite sending the correct token stream to the indexer, querying the data
is performing worse than dumping the documents as text and grepping them.
Any pointers gratefully received.
cheers
T
Background: I am indexing and searching technical reference documents, so
the standard language analyzers aren't appropriate. For example, the content
needs to be indexed so that a search for total matches total value,
total[value], and total(value), but a search for total[. only matches the
second of these.
As a first step I wrote a custom analyzer which uses
PatternCaptureGroupTokenFilter to split the token stream into word character
sequences (total, value) and non-word single characters (, ), [, ].
class TechAnalyzer extends Analyzer {
@override
protected TokenStreamComponents createComponents(String fieldname) {
WhitespaceTokenizer src = new WhitespaceTokenizer();
TokenStream result = new LowerCaseFilter(src);
Pattern alphanum = Pattern.compile("(\\w+) <file://w+)> );
Pattern nonalpha = Pattern.compile("(\\W) <file://W)> ");
result = new
PatternCaptureGroupTokenFilter(result,false,alphanum,nonalpha);
return new TokenStreamComponents(src,result);
}
}
I have tested this analyzer on diverse input files to verify that:
"total value" produces 2 tokens: total, value
"total(value)" produces 4 tokens: total, (, value, )
"total[value]" also produces 4 tokens: total, [, value, ]
So this is the analyzer used to build the index:
..
TechAnalyzer analyzer = new TechAnalyzer();
IndexWriterConfig iwc = new IndexWriterConfig(analyzer);
..
and it surely does. I can use Luke to inspect the terms in the index and see
that ( ) [ ] total and value are all present as separate terms.
So as far as I can tell, the indexing is happening as per requirement. Now
for searching, which is where it is going wrong.
If I want to search for the single word total everything is fine. The
problem is if I want to search for "total[".
String queryStr = "total[";
Query q = new QueryParser("text",new TechAnalyzer()).parse(queryStr);
..
This matches far too many documents because the query is being treated as a
synonym which matches either total or [. To confirm this, if I output
q.toString() I see "Synonym([ total)".
If instead I modify the input query so that it searches for a phrase ("total
[") then it appears to be looking for consecutive terms (q.toString() is
"total [") and it comes back with four results. All four matched documents
do indeed have "total[" in them; the trouble is that there are sixteen other
documents that should match as well, and it is not obvious to me why they
aren't being selected.
Using Luke again to find the two tokens total and [ in the documents, I see
the following for the first match:
For "total"
Position Offsets Payload
19
56
56
For "["
Position Offsets Payload
20
56
57
The actual string "total[" is in the document twice, if I inspect it myself.
For a document which Lucene does not match, but which it should, I can see
in Luke
"total"
Position Offsets Payload
78
80
"["
Position Offsets Payload
78
80
Again, if I inspect this document by hand, it contains "total[" twice.
I don't know if the empty offsets and payloads indicate a problem, and I
don't know if the duplicated positions in the first example are a problem
either (although that document is selected correctly!)
What I do know is that there must be something wrong somewhere because
despite sending the correct token stream to the indexer, querying the data
is performing worse than dumping the documents as text and grepping them.
Any pointers gratefully received.
cheers
T