Sorry I apologize for this being a bit long and for explaining the problem
at the very bottom after all the background, rather than starting with it at
the top. I thought it was easier to explain like this, please bear with me!
So I've indexed a library of technical documentation, and the index has
stored several fields per document: category, volume, title, text, etc.
Title and text are tokenised and stored, all other fields are just indexed.
When searching the index I am using the standard queryparser, and a typical
query might look like
"(title:graph AND title:axis) OR (text:graph AND text:axis)"
Because indexing includes synonym matching, I need the search to identify
matched terms in the content, e.g. in the above "graph" and "chart" are
synonyms, and "axis" and "axes" are as well.
So my search method executes the query to get a set of matching documents,
and uses the highlighter methods to identify the matches in the content:
private void doSearch( IndexReader reader, IndexSearcher searcher, Query
query, int max, FileWriter, writer, FileWriter matchlist ) {
SimpleHTMLFormatter htmlFormatter = newSimpleHTMLFormatter( hlPre,
hlPost ); // hlPre="\001"; hlPost="\002";
Highlighter highlighter = new Highlighter( htmlFormatter, new
QueryScorer( query ));
TopDocs results = searcher.search( query, max );
ScoreDoc[] hits = results.scoreDocs;
int numTotalHits = Math.toIntExact( results.totalHits.value );
HashSet<String> matchedWords = new HashSet<String>();
int start = 0;
int end = Math.min( numTotalHits, max );
for (int i = start; I < end; i++) {
Document doc = searcher.doc( hits[i].doc );
String text = doc.get( "text" );
try {
TokenStream tokens = TokenSources.getTokenStream( "text", null,
text, analyzer, -1 );
TextFragment[] frag = highlighter.getBestTextFragments( tokens,
text, true, 100 );
for ( int j = 0; j < frag.length; j++) {
if (( frag[j] != null ) && ( frag[j].getScore() > 0 )) {
addMatchedTerms( matchedWords, frag[j].toString() );
}
}
} catch .{
}
writer.write( doc.get("id") + "\n" );
}
for ( String word : matchedWords ) {
matchlist.write( word.toString() + "\n" );
}
}
There's more of course but that's the guts of it; I haven't shown the
analyzer or the method which extracts the delimited words from the fragment
and adds them to the matchedWords hashset.
In the simple example shown this works fine, and the matched words include
graph and axis and any other synonyms found in the selected documents.
The problem occurs when I use the query to filter the search by category or
by volume. I'm doing this by adding extra conditions to the query, e.g.
"(category:note AND volume:extra) AND ((title:graph AND title:axis) OR
(text:graph AND text:axis))"
When we do this the search correctly returns only documents in the selected
category/volume, but unfortunately the highlighter.getBestTextFragments()
method marks all the occurrences of "note" and "extra" in the content too.
This we don't want.
I can't see how to separate that part of the query out in the highlighter
methods, and I wonder what best practice would be here. I'm probably being
naive in using a single query for the whole job. Do I need to run a query
for category/volume, and then a subquery on text and title, and just use the
subquery in the highlighter? If that's the approach, is there a nice simple
explanation somewhere you could point me to? Because I'm a simple user who
has never done anything beyond using the simple QueryParser for everything.
cheers
T
at the very bottom after all the background, rather than starting with it at
the top. I thought it was easier to explain like this, please bear with me!
So I've indexed a library of technical documentation, and the index has
stored several fields per document: category, volume, title, text, etc.
Title and text are tokenised and stored, all other fields are just indexed.
When searching the index I am using the standard queryparser, and a typical
query might look like
"(title:graph AND title:axis) OR (text:graph AND text:axis)"
Because indexing includes synonym matching, I need the search to identify
matched terms in the content, e.g. in the above "graph" and "chart" are
synonyms, and "axis" and "axes" are as well.
So my search method executes the query to get a set of matching documents,
and uses the highlighter methods to identify the matches in the content:
private void doSearch( IndexReader reader, IndexSearcher searcher, Query
query, int max, FileWriter, writer, FileWriter matchlist ) {
SimpleHTMLFormatter htmlFormatter = newSimpleHTMLFormatter( hlPre,
hlPost ); // hlPre="\001"; hlPost="\002";
Highlighter highlighter = new Highlighter( htmlFormatter, new
QueryScorer( query ));
TopDocs results = searcher.search( query, max );
ScoreDoc[] hits = results.scoreDocs;
int numTotalHits = Math.toIntExact( results.totalHits.value );
HashSet<String> matchedWords = new HashSet<String>();
int start = 0;
int end = Math.min( numTotalHits, max );
for (int i = start; I < end; i++) {
Document doc = searcher.doc( hits[i].doc );
String text = doc.get( "text" );
try {
TokenStream tokens = TokenSources.getTokenStream( "text", null,
text, analyzer, -1 );
TextFragment[] frag = highlighter.getBestTextFragments( tokens,
text, true, 100 );
for ( int j = 0; j < frag.length; j++) {
if (( frag[j] != null ) && ( frag[j].getScore() > 0 )) {
addMatchedTerms( matchedWords, frag[j].toString() );
}
}
} catch .{
}
writer.write( doc.get("id") + "\n" );
}
for ( String word : matchedWords ) {
matchlist.write( word.toString() + "\n" );
}
}
There's more of course but that's the guts of it; I haven't shown the
analyzer or the method which extracts the delimited words from the fragment
and adds them to the matchedWords hashset.
In the simple example shown this works fine, and the matched words include
graph and axis and any other synonyms found in the selected documents.
The problem occurs when I use the query to filter the search by category or
by volume. I'm doing this by adding extra conditions to the query, e.g.
"(category:note AND volume:extra) AND ((title:graph AND title:axis) OR
(text:graph AND text:axis))"
When we do this the search correctly returns only documents in the selected
category/volume, but unfortunately the highlighter.getBestTextFragments()
method marks all the occurrences of "note" and "extra" in the content too.
This we don't want.
I can't see how to separate that part of the query out in the highlighter
methods, and I wonder what best practice would be here. I'm probably being
naive in using a single query for the whole job. Do I need to run a query
for category/volume, and then a subquery on text and title, and just use the
subquery in the highlighter? If that's the approach, is there a nice simple
explanation somewhere you could point me to? Because I'm a simple user who
has never done anything beyond using the simple QueryParser for everything.
cheers
T