Heh,
Cool :-) In principle we might be able to, but it will be a while,
as our legal and biz dev will be involved.
However, I do believe everything I did was referred to by Dave as
some point. Most of the changes are pretty obvious if you run through
the code.
I'm about to do a bunch of benchmarking (maybe 2 weeks?) on Linux
and Solaris, of Texis and Lucene, in 4 different configurations
(weighted and unweighted, sloppy phrase match and conjunctive). I'll
post a summary :)
A lot about optimizing Lucene involves taming GC with RAMdirectory.
I would say that using RAMdirectory is a huge saving.
Minimize fields -- have one indexed, tokenized, not stored, one with
the "content" as a monolithic field (parse it afterwards). Write a
custom Hit Collector if appropriate. Minimize classes, stick with
Java builtins as much as possible.
There are other considerations choosing between texis and lucene --
cost(!!) and caching (as I said). Memory maxes out at 4GB on most
normal boxes, so if you can't fit your document base and index in
<4GB, then you need the caching.
Winton
>Hello,
>
>Funny, I was just wondering how Lucene compares to Texis the other day.
>Yes, I guess Lucene doesn't have any caching. Perhaps this could
>easily be added by making use of one of many caching projects that seem
>to be popping up under Jakarta (jakarta.apache.org).
>
>Winston, if appropriate, could you share some of the changes you made
>to Lucene to support the query rate that you mentioned?
>
>Thanks,
>Otis
>
>
>--- Winton Davies <wdavies@cs.stanford.edu> wrote:
>> Hi,
>>
>> We're (Overture/Goto) evaluating Lucene ... email me specific
>> questions.
>>
>> In general I would say Lucene is very efficient. It is only about
>> 30% slower than Thunderstone Texis
>> (which is a native C code base). Main difference is that Lucene
>> doesn't handle Caching as well as
>> Texis does.
>>
>> Basically the Index is on Disk or in RAM (ie can take up 400-500 MB
>>
>> in our application). Texis for example
>> is able to buffer what it can of the Index in memory without
>> explicit setting of memory limits.
>>
>> Out of the box we couldn't use Phrase Matching for very high volume
>>
>> transactions (we're looking at 1000s queries/sec)
>> and had to customize it to your needs, but because its Open Source,
>>
>> guess what, you can write any kind
>> of optimizations you want. Actually that isn't fair -- just be
>> careful that you understand the performance
>> parameters involved in text retrieval and the various types of
>> querys that are possible. Do you need Text Retrieval
>> or Are you doing an unranked "Text Search" ?
>>
>>
>> Oh, and its free :)
>>
>> Reliable ? Well I've never had a problem someone couldnt answer,
>> and
>> it never crashes (ie its pretty bug-free
>> as far as I can tell).
>o:lucene-user-help@jakarta.apache.org>
>>
>
>
>__________________________________________________
>Do You Yahoo!?
>Send FREE video emails in Yahoo! Mail!
>http://promo.yahoo.com/videomail/
>
>--
>To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
>For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>
--
Winton Davies
Lead Engineer, Overture (NSDQ: OVER)
1820 Gateway Drive, Suite 360
San Mateo, CA 94404
work: (650) 403-2259
cell: (650) 867-1598
http://www.overture.com/ --
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>