Mailing List Archive: Ferret: A native Ruby port of Apache Lucene

Ferret: A native Ruby port of Apache Lucene

Oct 23, 2005, 10:55 PM

Post #1 of 3 (9750 views)

Hi All,

Brian McCallister suggested I announce Ferret here but it appears, from the
archives, that Erik Hatcher beat me to it. Anyway, in case you missed it,
the url is;

http://ferret.davebalmain.com/trac/

Just to clarify a few things, the reason I chose the name Ferret and not
RubyLucene is because 1) rubylucene was already taken ;-) and 2) I didn't
want to feel tied to the Apache Lucene API. If there is a better way to do
something in Ruby, I'd like to do it that way. Having said that, I've mostly
stuck to the Apache Lucene API. And I intend to continue supporting the
Apache Lucene index format. Hopefully some of the ideas in Ferret will one
day be adopted back into the Apache Lucene project.

As for what I've ported so far? Almost everything. All query types including
span queries are in there.

Where Ferret falls short at the moment is;

* Robustness: It's still alpha and some areas are better tested than others.
In particular, I really need to test threading.
* Unicode Support: At the index level, this isn't a problem, but as far as
analysis and search go, it needs a little work.
* Performance: Well, it is Ruby. However, I've written the indexer in C and
it makes the Java version seem painfully slow. So don't expect Ferret to
remain slower than Apache Lucene forever.

So, if anyone wants to help out, please download it, play with it and give
me your feedback. It's available as a gem and there is a quick tutorial
here;

http://ferret.davebalmain.com/api/files/TUTORIAL.html

Happy Ferreting.
Dave

Re: Ferret: A native Ruby port of Apache Lucene [ In reply to ]

erik at ehatchersolutions

Oct 24, 2005, 2:44 AM

Post #2 of 3 (9438 views)

Permalink

Dave - thanks for Ferret, and your sharing it with us here. I've not
had a chance to try it personally yet, but will very soon.

On 24 Oct 2005, at 01:55, David Balmain wrote:
> Just to clarify a few things, the reason I chose the name Ferret
> and not
> RubyLucene is because 1) rubylucene was already taken ;-)

Well, it wasn't really taken exactly. I started rucene and renamed
it to rubylucene at rubyforge a long while ago and never did anything
with it beyond some very very rudimentary low-level I/O work. My
main hesitation, besides the zillion other time commitments, was my
concern about duplicating effort given how PyLucene works with GCJ
and SWIG and can come up to speed with Java Lucene almost automatically.

> and 2) I didn't
> want to feel tied to the Apache Lucene API. If there is a better
> way to do
> something in Ruby, I'd like to do it that way. Having said that,
> I've mostly
> stuck to the Apache Lucene API. And I intend to continue supporting
> the
> Apache Lucene index format. Hopefully some of the ideas in Ferret
> will one
> day be adopted back into the Apache Lucene project.

As long as it is compatible with the index format I think it's fair
to use "lucene" in the name. You're welcome to "rubylucene" if you
like :)

> As for what I've ported so far? Almost everything. All query types
> including
> span queries are in there.

Impressive!

Lucene 1.4.3 compatibility? Or TRUNK?

> * Performance: Well, it is Ruby. However, I've written the indexer
> in C and
> it makes the Java version seem painfully slow. So don't expect
> Ferret to
> remain slower than Apache Lucene forever.

Do you think there is anything about the Java implementation that
could be improved in this regard so that the difference is not so
dramatic? What is the C code optimizing that the Java is not?
Surely we could bring the Java implementation close to C level speed
in terms of I/O, no?

While indexing speed is certainly very important, in many (most?)
projects the searching speed is the main concern and indexing speed
is of much less concern.

> So, if anyone wants to help out, please download it, play with it
> and give
> me your feedback. It's available as a gem and there is a quick
> tutorial
> here;
>
> http://ferret.davebalmain.com/api/files/TUTORIAL.html

You can count on it... I'll be there a little later today!

Erik

Re: Ferret: A native Ruby port of Apache Lucene [ In reply to ]

dbalmain.ml at gmail

Oct 24, 2005, 4:12 AM

Post #3 of 3 (9400 views)

Permalink

On 10/24/05, Erik Hatcher <erik@ehatchersolutions.com> wrote:
>
> Do you think there is anything about the Java implementation that
> could be improved in this regard so that the difference is not so
> dramatic? What is the C code optimizing that the Java is not?
> Surely we could bring the Java implementation close to C level speed
> in terms of I/O, no?

Just a quick answer because I want to tackle your OS X problem. Basically I
think this is just something that C really excels at. I haven't profiled
Lucene yet but I'm guessing a lot of the time is taking in the read and
write byte methods. Basically just because these methods are called so many
times. I could be very wrong about this, I'm expert on the JVM, but I think
that while Java is able to optimize the implementation of things like
sorting and hashing, the sheer number of simple instructions are just going
to be a lot quicker in C.

So that wasn't very helpful to you, but one area that could be improved is
memory management. Obviously I had to keep a pretty close eye on it in C and
it was definitely the most difficult part of porting to C. Anyway, I found a
few places where Lucene could be a little more frugal with the memory. For
example, the TermEnum creates a new Term object for each term as you skip
through. One could save a lot of memory and time by doing the comparisons
against the TermBuffer object, instead of creating a new object. This is
just an example (and it's also something I need to fix in Ferret).

While indexing speed is certainly very important, in many (most?)
> projects the searching speed is the main concern and indexing speed
> is of much less concern.
>

Definitely agreed. But a lot of the search speed is also influenced by how
fast the indexer can read the indexes. So by speeding up the indexing
module, you'll get a lot of gains in search speed too. I wouldn't want to
rewrite the search functionality in C as that as the part of the library
that people will want to extend. (Same goes for Analysis). And it is in a
lot more flux than the indexer.