Hi all,
Miles Barr and Erik Hatcher posted on my webby and since i wanted to get
in contact with you sooner or later anyway, i'm doing it now :).
Originaly i wanted to wait until i have some more quality code, but
well...
(So if you read something like "it's working" or "i ported", don't take
this to literal please ;))
So let me introduce myself first, i'm Max Nickel and am working on a
project i called Rise, what tries to be a ruby implementation of Lucene.
I just read some of the recent posts on this mailing list, and it seems
that you are concentrating your efforts on getting it done with SWIG, so
i don't know if what i did will be of much use for you.
I took a different approach and first tried a pure ruby implementation.
This was rise-0.1.1 what you also can get on rise.rubyforge.org or on my
outdated Arch repo. At this stage everything was still very buggy and
nowhere what you can call working, but i had enough working to see that
pure ruby simply is unacceptable slow (i expected this to happen
anyway).
So at this point i decided to port some of the more important parts in
terms of performance to C. I know that this might not be the best
approach when you care about portability or deployment, but i felt that
if you want to do something different then indexing your adressbook it
was necessary.
Right now i have ported following classes either complete or parts of it
as Mixins:
FS/RAM-IO, Tokenizers upto LowerCaseTokenizer, Term, TermBuffer, Token,
QuickSort, HeapSort, TermInfosWriter#add and #write,
DocumentWriter#writePostings and #addPosition, and SegmentTermEnum +
some helper classes.
The C implementations doesn't use any different headers then ruby.h or
rubyio.h (only once sys/stdlib.h is needed in fsio.c), so everywhere
where ruby compiles, rise should compile also.
Also nearly all classes except the IO ones, aren't pure C, but make use
of ruby's C functions like rb_ivar_*, rb_funcall etc.
As i wrote in an email to Miles Barr earlier, here are some very rough
indexing stats:
/usr/src/linux of a recent 2.6.12 kernel takes on my machine
with Lucene ~4 Minutes
with Rise in pure ruby > 60 Minutes
with my current Rise/C impl ~20 Minutes.
The current status is unfortunatly broken, since somewhere on my recent
changes i made some stupid mistake and keep getting "Docs out of
order"-Exceptions when merging segments. I havent had much time on my
hand lately to hunt this bug, but i hope it will be the last major one
before 0.1.2 release (except that the searching side is broken as it
isnt updated to the changes i made yet).
Since i was tired of GNU/Archs UI and switched to monotone you also cant
get my current sources. But when i managed to setup my local server i'll
let you know.
kind regards,
/max
Miles Barr and Erik Hatcher posted on my webby and since i wanted to get
in contact with you sooner or later anyway, i'm doing it now :).
Originaly i wanted to wait until i have some more quality code, but
well...
(So if you read something like "it's working" or "i ported", don't take
this to literal please ;))
So let me introduce myself first, i'm Max Nickel and am working on a
project i called Rise, what tries to be a ruby implementation of Lucene.
I just read some of the recent posts on this mailing list, and it seems
that you are concentrating your efforts on getting it done with SWIG, so
i don't know if what i did will be of much use for you.
I took a different approach and first tried a pure ruby implementation.
This was rise-0.1.1 what you also can get on rise.rubyforge.org or on my
outdated Arch repo. At this stage everything was still very buggy and
nowhere what you can call working, but i had enough working to see that
pure ruby simply is unacceptable slow (i expected this to happen
anyway).
So at this point i decided to port some of the more important parts in
terms of performance to C. I know that this might not be the best
approach when you care about portability or deployment, but i felt that
if you want to do something different then indexing your adressbook it
was necessary.
Right now i have ported following classes either complete or parts of it
as Mixins:
FS/RAM-IO, Tokenizers upto LowerCaseTokenizer, Term, TermBuffer, Token,
QuickSort, HeapSort, TermInfosWriter#add and #write,
DocumentWriter#writePostings and #addPosition, and SegmentTermEnum +
some helper classes.
The C implementations doesn't use any different headers then ruby.h or
rubyio.h (only once sys/stdlib.h is needed in fsio.c), so everywhere
where ruby compiles, rise should compile also.
Also nearly all classes except the IO ones, aren't pure C, but make use
of ruby's C functions like rb_ivar_*, rb_funcall etc.
As i wrote in an email to Miles Barr earlier, here are some very rough
indexing stats:
/usr/src/linux of a recent 2.6.12 kernel takes on my machine
with Lucene ~4 Minutes
with Rise in pure ruby > 60 Minutes
with my current Rise/C impl ~20 Minutes.
The current status is unfortunatly broken, since somewhere on my recent
changes i made some stupid mistake and keep getting "Docs out of
order"-Exceptions when merging segments. I havent had much time on my
hand lately to hunt this bug, but i hope it will be the last major one
before 0.1.2 release (except that the searching side is broken as it
isnt updated to the changes i made yet).
Since i was tired of GNU/Archs UI and switched to monotone you also cant
get my current sources. But when i managed to setup my local server i'll
let you know.
kind regards,
/max