Mailing List Archive

HighLighting Service
hi all,
i am working on the Highlight terms functionality of Lucene.
I followed step by step the suggestion of Maik Schreiber (http://www.iq-computing.de/lucene/highlight.htm), i implemented it with some changes:
In the white paper the HL was based just on the summary field, my version read the document (from a cache) with SelfBufferedStream method, into a string that is passed to the HighLight method.

Some problem show up here:

1.It doesn't work with all the Query , e.g.: WidcardQuery,FuzzyQuery,PrefixQuery, PhraseQuery.

3.The response time is not constant, e.g.: if the documents to produce highlight are big files , like 2/4 MB , the average response time per query is:
-20 sec 10 doc of 2 mb each
otherwise for small files:
-0.6 sec 10 doc of 20 kb each

What we can do? any suggestion?

Some tips :
1.Document must be plain text to have a good result, this mean there are 2 options: first build a text version on the document at runtime (if there are big document this will be an other handycap in response time), second have a cache of all the document is plain text version.

2.The HL process produce an highlighted version for the entire document, while would be good have just a portion or 2 or 3.
In this case we can take advantage because we cut the iteration process when we are done, saving some time and resource.

3.I think we should incorporate this feature in Lucene, right now to make this working you should change some code in the Lucene package, so stay up to date require to change every time these part of code (if they are still there!!).Also because it strictly depend on the Lucene core package.

I attach my version of the LuceneTools.java and the code i wrote used by the servlet:
...
String brief;
String url = doc.get("url"); //get the cached plain text version of document to highlight
StringBuffer sb = new StringBuffer("");
StringBuffer sblower = new StringBuffer("");
String s = new String();
FileInputStream fis = new FileInputStream(url) ;
byte[] b = new byte[1024];
int effective=-1;
while( (effective=fis.read(b))!=-1 )
{
s = new String(b);
sb.append( s );
sblower.append(s.toLowerCase());
}
fis.close();
{
brief = LuceneTools.highlightTerms( sb.toString() , sblower.toString(), highLighter , query, analyzer);
}
catch(Exception e)
{e.printStackTrace();}
out.println(searchUI.getSearchItem(score,doctitle,url,"..."+brief+"..."));

....

I hope someone can help me giving some tips to make me able to complete this functionality.
Thanks, bye.




See Dave Matthews Band live or win a signed guitar
http://r.lycos.com/r/bmgfly_mail_dmb/http://win.ipromotions.com/lycos_020201/splash.asp
Re: HighLighting Service [ In reply to ]
On Tue, 2002-04-09 at 20:22, none none wrote:

> i am working on the Highlight terms functionality of Lucene.
>
> Some problem show up here:
>
> 1.It doesn't work with all the Query , e.g.: WidcardQuery,FuzzyQuery,PrefixQuery, PhraseQuery.

One thing I did was to modify LuceneTools, (well, I rewrote it
eventually) to output regular expressions instead of just terms. Then
use gnu.regexp or Jakarta ORO to match expressions against various forms
of the original documents. This allows you to do custom highlighting
(ie. highlight entire phrases not just the tokens in those phrases). It
also allows you to do wildcard matching with better speed if you
generate a single expression for the wildcard query, rather than
matching against every single term the wildcard query would match
individually. I didn't address FuzzyQuery or date queries.

> What we can do? any suggestion?

A method of generating document context is to store the body of your
document in the index. Then retrieve it, normalize any whitespace,
abbreviate the text at the first hit, and highlight the relevant terms
in the abbreviated text. This doesn't sound all that quick, but it
proved to be much quicker than consulting the original document in some
non-numerical tests I did.

That works really well for context extracts. However, it may or may not
be applicable to highlighting the entire document - it would depend on
the original format of your documents I think. I still consult the
original (HTML) documents for doing that, but all my documents are
fairly short.

> 3.I think we should incorporate this feature in Lucene, right now to make this
> working you should change some code in the Lucene package, so stay up
> to date require to change every time these part of code (if they are
> still there!!).Also because it strictly depend on the Lucene core
> package.

There are a whole bunch of different ways of implementing highlighting;
not all of them require changes to Lucene's core. I think integrating a
full highlight retrieval system into Lucene that's sufficiently generic
to match with Lucene's architecture might be difficult at best...

> I hope someone can help me giving some tips to make me able to complete this functionality.

I'm not 100% sure what you need to do further?

For what it's worth, if your current code is sufficient, I'd go with
that. I've refactored a few highlighting systems, and most of them end
up with quite a lot of code, depending on how detailed your spec is.

Regards,

--
Lee Mallabone.



--
To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>
Re: HighLighting Service [ In reply to ]
thank you very much,
how i can add the entire content of the page inside the index?
I know i have to change the .jj file but i don't know how do that.
Do this option slow down the query? if so i'll just load from a cache.
i think i'll use the Jakarta ORO, if you can send to me an example i'll appreciate, i'll never used the Jakarta ORO or gnu.regexp.
thanks again,
bye.


See Dave Matthews Band live or win a signed guitar
http://r.lycos.com/r/bmgfly_mail_dmb/http://win.ipromotions.com/lycos_020201/splash.asp

--
To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>
Re: HighLighting Service [ In reply to ]
Hi Lee,


Would you like to add you code to the contributions?

Thanks

--Peter


On 4/10/02 1:46 AM, "Lee Mallabone" <lee@grantadesign.com> wrote:

> On Tue, 2002-04-09 at 20:22, none none wrote:
>
>> i am working on the Highlight terms functionality of Lucene.
>>
>> Some problem show up here:
>>
>> 1.It doesn't work with all the Query , e.g.:
>> WidcardQuery,FuzzyQuery,PrefixQuery, PhraseQuery.
>
> One thing I did was to modify LuceneTools, (well, I rewrote it
> eventually) to output regular expressions instead of just terms. Then
> use gnu.regexp or Jakarta ORO to match expressions against various forms
> of the original documents. This allows you to do custom highlighting
> (ie. highlight entire phrases not just the tokens in those phrases). It
> also allows you to do wildcard matching with better speed if you
> generate a single expression for the wildcard query, rather than
> matching against every single term the wildcard query would match
> individually. I didn't address FuzzyQuery or date queries.
>
>> What we can do? any suggestion?
>
> A method of generating document context is to store the body of your
> document in the index. Then retrieve it, normalize any whitespace,
> abbreviate the text at the first hit, and highlight the relevant terms
> in the abbreviated text. This doesn't sound all that quick, but it
> proved to be much quicker than consulting the original document in some
> non-numerical tests I did.
>
> That works really well for context extracts. However, it may or may not
> be applicable to highlighting the entire document - it would depend on
> the original format of your documents I think. I still consult the
> original (HTML) documents for doing that, but all my documents are
> fairly short.
>
>> 3.I think we should incorporate this feature in Lucene, right now to make
>> this
>> working you should change some code in the Lucene package, so stay up
>> to date require to change every time these part of code (if they are
>> still there!!).Also because it strictly depend on the Lucene core
>> package.
>
> There are a whole bunch of different ways of implementing highlighting;
> not all of them require changes to Lucene's core. I think integrating a
> full highlight retrieval system into Lucene that's sufficiently generic
> to match with Lucene's architecture might be difficult at best...
>
>> I hope someone can help me giving some tips to make me able to complete this
>> functionality.
>
> I'm not 100% sure what you need to do further?
>
> For what it's worth, if your current code is sufficient, I'd go with
> that. I've refactored a few highlighting systems, and most of them end
> up with quite a lot of code, depending on how detailed your spec is.
>
> Regards,


--
To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>
Re: HighLighting Service [ In reply to ]
On Wed, 2002-04-10 at 15:57, Peter Carlson wrote:
>
> Would you like to add you code to the contributions?

Hi Peter,

Much as I'd like to, there are a couple of problems preventing me from
doing so immediately:

I make certain assumptions about the "cleanliness" of the query input
because I use a very restrictive version of the query parser. I'm not
certain these assumptions would hold in the general case.

The other problem is that all the code I've written has been on company
time, so would require the copyright to be released before I could
contribute. I'd need to talk to my boss about this.

I'm don't know how much what I've written would be useful to the Lucene
community - I have a bunch of code that handles indexing, context
generation and highlighting of HTML documents. However, the code can
also make certain assumptions because I can guarantee that my HTML
documents will not, for example, contain masses of malformed tags and
script. Is it still worth trying to contribute a solution that's not
going to be useful to *everyone* wanting HTML searching/highlighting?

Regards,

--
Lee Mallabone.



--
To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>
Re: HighLighting Service [ In reply to ]
On Wed, 2002-04-10 at 15:50, none none wrote:

> how i can add the entire content of the page inside the index?
> I know i have to change the .jj file but i don't know how do that.

I don't speak JavaCC very well so can't really help there. I found
JavaCC ended up hindering rather than helping me, but that could just be
my parser ignorance.

> Do this option slow down the query? if so i'll just load from a cache.

I don't know if storing fields as well as indexing them slows down a
query. I wouldn't have thought it slows the query down very much at all,
but someone who's worked on Lucene's internals more would be much better
qualified to answer that.

> i think i'll use the Jakarta ORO, if you can send to me an example i'll > appreciate, i'll never used the Jakarta ORO or gnu.regexp.

gnu.regexp is very simple to use - its site seems to be down from here
at the moment though.

Regards,

--
Lee Mallabone.



--
To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>
RE: HighLighting Service [ In reply to ]
FYI - The latest edition of Java Developers Journal has an article on using
JavaCC.

> -----Original Message-----
> From: Lee Mallabone [mailto:lee@grantadesign.com]
> Sent: Wednesday, April 10, 2002 10:39 AM
> To: korfut@lycos.com
> Cc: lucene-dev@jakarta.apache.org
> Subject: Re: HighLighting Service
>
>
> On Wed, 2002-04-10 at 15:50, none none wrote:
>
> > how i can add the entire content of the page inside the index?
> > I know i have to change the .jj file but i don't know how do that.
>
> I don't speak JavaCC very well so can't really help there. I found
> JavaCC ended up hindering rather than helping me, but that
> could just be
> my parser ignorance.
>
> > Do this option slow down the query? if so i'll just load
> from a cache.
>
> I don't know if storing fields as well as indexing them slows down a
> query. I wouldn't have thought it slows the query down very
> much at all,
> but someone who's worked on Lucene's internals more would be
> much better
> qualified to answer that.
>
> > i think i'll use the Jakarta ORO, if you can send to me an
> example i'll > appreciate, i'll never used the Jakarta ORO or
> gnu.regexp.
>
> gnu.regexp is very simple to use - its site seems to be down from here
> at the moment though.
>
> Regards,
>
> --
> Lee Mallabone.
>
>
>
> --
> To unsubscribe, e-mail:
> <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
> For additional commands, e-mail:
> <mailto:lucene-dev-help@jakarta.apache.org>
>