hi all,
i am working on the Highlight terms functionality of Lucene.
I followed step by step the suggestion of Maik Schreiber (http://www.iq-computing.de/lucene/highlight.htm), i implemented it with some changes:
In the white paper the HL was based just on the summary field, my version read the document (from a cache) with SelfBufferedStream method, into a string that is passed to the HighLight method.
Some problem show up here:
1.It doesn't work with all the Query , e.g.: WidcardQuery,FuzzyQuery,PrefixQuery, PhraseQuery.
3.The response time is not constant, e.g.: if the documents to produce highlight are big files , like 2/4 MB , the average response time per query is:
-20 sec 10 doc of 2 mb each
otherwise for small files:
-0.6 sec 10 doc of 20 kb each
What we can do? any suggestion?
Some tips :
1.Document must be plain text to have a good result, this mean there are 2 options: first build a text version on the document at runtime (if there are big document this will be an other handycap in response time), second have a cache of all the document is plain text version.
2.The HL process produce an highlighted version for the entire document, while would be good have just a portion or 2 or 3.
In this case we can take advantage because we cut the iteration process when we are done, saving some time and resource.
3.I think we should incorporate this feature in Lucene, right now to make this working you should change some code in the Lucene package, so stay up to date require to change every time these part of code (if they are still there!!).Also because it strictly depend on the Lucene core package.
I attach my version of the LuceneTools.java and the code i wrote used by the servlet:
...
String brief;
String url = doc.get("url"); //get the cached plain text version of document to highlight
StringBuffer sb = new StringBuffer("");
StringBuffer sblower = new StringBuffer("");
String s = new String();
FileInputStream fis = new FileInputStream(url) ;
byte[] b = new byte[1024];
int effective=-1;
while( (effective=fis.read(b))!=-1 )
{
s = new String(b);
sb.append( s );
sblower.append(s.toLowerCase());
}
fis.close();
{
brief = LuceneTools.highlightTerms( sb.toString() , sblower.toString(), highLighter , query, analyzer);
}
catch(Exception e)
{e.printStackTrace();}
out.println(searchUI.getSearchItem(score,doctitle,url,"..."+brief+"..."));
....
I hope someone can help me giving some tips to make me able to complete this functionality.
Thanks, bye.
See Dave Matthews Band live or win a signed guitar
http://r.lycos.com/r/bmgfly_mail_dmb/http://win.ipromotions.com/lycos_020201/splash.asp
i am working on the Highlight terms functionality of Lucene.
I followed step by step the suggestion of Maik Schreiber (http://www.iq-computing.de/lucene/highlight.htm), i implemented it with some changes:
In the white paper the HL was based just on the summary field, my version read the document (from a cache) with SelfBufferedStream method, into a string that is passed to the HighLight method.
Some problem show up here:
1.It doesn't work with all the Query , e.g.: WidcardQuery,FuzzyQuery,PrefixQuery, PhraseQuery.
3.The response time is not constant, e.g.: if the documents to produce highlight are big files , like 2/4 MB , the average response time per query is:
-20 sec 10 doc of 2 mb each
otherwise for small files:
-0.6 sec 10 doc of 20 kb each
What we can do? any suggestion?
Some tips :
1.Document must be plain text to have a good result, this mean there are 2 options: first build a text version on the document at runtime (if there are big document this will be an other handycap in response time), second have a cache of all the document is plain text version.
2.The HL process produce an highlighted version for the entire document, while would be good have just a portion or 2 or 3.
In this case we can take advantage because we cut the iteration process when we are done, saving some time and resource.
3.I think we should incorporate this feature in Lucene, right now to make this working you should change some code in the Lucene package, so stay up to date require to change every time these part of code (if they are still there!!).Also because it strictly depend on the Lucene core package.
I attach my version of the LuceneTools.java and the code i wrote used by the servlet:
...
String brief;
String url = doc.get("url"); //get the cached plain text version of document to highlight
StringBuffer sb = new StringBuffer("");
StringBuffer sblower = new StringBuffer("");
String s = new String();
FileInputStream fis = new FileInputStream(url) ;
byte[] b = new byte[1024];
int effective=-1;
while( (effective=fis.read(b))!=-1 )
{
s = new String(b);
sb.append( s );
sblower.append(s.toLowerCase());
}
fis.close();
{
brief = LuceneTools.highlightTerms( sb.toString() , sblower.toString(), highLighter , query, analyzer);
}
catch(Exception e)
{e.printStackTrace();}
out.println(searchUI.getSearchItem(score,doctitle,url,"..."+brief+"..."));
....
I hope someone can help me giving some tips to make me able to complete this functionality.
Thanks, bye.
See Dave Matthews Band live or win a signed guitar
http://r.lycos.com/r/bmgfly_mail_dmb/http://win.ipromotions.com/lycos_020201/splash.asp