Mailing List Archive

A change of direction
So I haven't been putting a whole lot of work into Lucene4c lately,
mainly as a result of picking up and moving across the country for a new
job. That's pretty much taken care of at this point, and I'm finally
settled enough that I can pick up this project again.

Well, actually, I've been at least sort of settled for the past few
weeks, but every time I sat down and tried to get started with the part
of Lucene4c I had been hacking on before I moved I just never got
anywhere. So I put it down and stopped worrying about it for a while.

Then, in the past few days I got motivated again, although it was in a
slightly different way than I expected. I noticed that the linux
distribution I had on my laptop actually happened to have a recent
snapshot of GCJ on it, and since it'd been a while since I really played
with GCJ I decided to give it a shot. Lucene4c was the obvious place to
start playing.

So far, I've got a reasonable start on a version of Lucene4c that sits
on top of a GCJ compiled Lucene, using its CNI interfaces to call from a
thin layer of C++ code into Java. Right now I've got it creating
documents and fields, and later tonight I expect to be indexing
documents. Hopefully searching will come reasonably soon after that.

I think this is the shortest path towards actually having something that
people can make use of, which will hopefully lead to this becoming a
project with more people than just me hacking on it ;-)

Does this mean I'm getting rid of the C implementation? Well, yes and
no. I'm not planning on hacking on it in the near future, but there is
a large chunk of useful code there, and I'm not just going to delete it
or something. When writing the interface to the GCJ backed code I'm
going to make a concerted effort to make it possible to plug a C
implementation in as the backend without too much trouble, so perhaps in
the future someone will be able to revisit the code and make it useful
again.

Anyway, if anyone wants to take a look at what I've got, just speak up
and I'll commit it on a branch in the repository. I'm planning on doing
that eventually anyway (probably once it hits "minimum level of
functionality", which at this point means "basic indexing and searching
is working"), but if other people want to see what I've got I'm happy to
speed up the process.

-garrett
Re: A change of direction [ In reply to ]
Garrett Rooney wrote:
> So I haven't been putting a whole lot of work into Lucene4c lately,
> mainly as a result of picking up and moving across the country for a new
> job. That's pretty much taken care of at this point, and I'm finally
> settled enough that I can pick up this project again.
>
> Well, actually, I've been at least sort of settled for the past few
> weeks, but every time I sat down and tried to get started with the part
> of Lucene4c I had been hacking on before I moved I just never got
> anywhere. So I put it down and stopped worrying about it for a while.
>
> Then, in the past few days I got motivated again, although it was in a
> slightly different way than I expected. I noticed that the linux
> distribution I had on my laptop actually happened to have a recent
> snapshot of GCJ on it, and since it'd been a while since I really played
> with GCJ I decided to give it a shot. Lucene4c was the obvious place to
> start playing.
>
> So far, I've got a reasonable start on a version of Lucene4c that sits
> on top of a GCJ compiled Lucene, using its CNI interfaces to call from a
> thin layer of C++ code into Java. Right now I've got it creating
> documents and fields, and later tonight I expect to be indexing
> documents. Hopefully searching will come reasonably soon after that.
>

I assume the API will be similar/same as the Java Lucene?

For mod_mbox(or mod_Lucene?), it will need to drop to the C level
instead of C++. The two places to do a wrapper are inside each project,
or at the library API level. Doing it at each project will be easier in
the short run, since they can access the API functionality they need and
easily return things to C. Writing a C API to wrap a complete C++ will
be harder.

> I think this is the shortest path towards actually having something that
> people can make use of, which will hopefully lead to this becoming a
> project with more people than just me hacking on it ;-)
>
> Does this mean I'm getting rid of the C implementation? Well, yes and
> no. I'm not planning on hacking on it in the near future, but there is
> a large chunk of useful code there, and I'm not just going to delete it
> or something. When writing the interface to the GCJ backed code I'm
> going to make a concerted effort to make it possible to plug a C
> implementation in as the backend without too much trouble, so perhaps in
> the future someone will be able to revisit the code and make it useful
> again.

What I liked most about the 'Lucene4c' was avoiding the dependency on
java. Not that java is inherently evil, just that it creates a rather
large dependency chain. For a C/C++ application that just wishes to use
a Lucene Search, depending on a native C library can be much easier than
a GCJ based library. I should however, hold judgment, I truthfully I
haven't done much with GCJ, and my experiences with it are all
relatively old.

I can't argue on a technical basis that this new approach is wrong, I
know that it makes the amount of code to be written is much smaller.

It would be interesting to try to combine the two, using the GCJ back
end where no one has written the code yet, and native C when possible.
I suspect the native C would never be completed however, since the GCJ
back end would be 'good enough' for most people.

> Anyway, if anyone wants to take a look at what I've got, just speak up
> and I'll commit it on a branch in the repository. I'm planning on doing
> that eventually anyway (probably once it hits "minimum level of
> functionality", which at this point means "basic indexing and searching
> is working"), but if other people want to see what I've got I'm happy to
> speed up the process.


I would like to see it when you get a chance, I want to integrate it
with mod_mbox, but honestly, no rush. I doubt I will have time for
mod_mbox hacking for about a week...

-Paul
Re: A change of direction [ In reply to ]
Paul Querna wrote:

> I assume the API will be similar/same as the Java Lucene?
>
> For mod_mbox(or mod_Lucene?), it will need to drop to the C level
> instead of C++. The two places to do a wrapper are inside each project,
> or at the library API level. Doing it at each project will be easier in
> the short run, since they can access the API functionality they need and
> easily return things to C. Writing a C API to wrap a complete C++ will
> be harder.

Actually, the API looks quite a bit like the current Lucene4c API.
Things are similar to the Java API in that the names of the key objects
are the same (since the Lucene4c objects are just thin wrappers around
the Java ones), but I'm trying to keep the feel of the API as similar to
an APR/C library as possible.

While the code that actually interfaces with the Java Lucene is of
necessity written in GCJ's C++/CNI dialect, the actual interface exposed
by the Lucene4c header files is pure C. Java level exceptions are
converted into lcn_error_t objects, C level UTF-8 strings are converted
into java.lang.String objects, all the objects are hidden behind opaque
structs, access happens via functions that are declared extern "C", etc.

You'll need to link against libgcj, but that's the only externally
visible indication that you're calling into compiled Java code underneath.

> What I liked most about the 'Lucene4c' was avoiding the dependency on
> java. Not that java is inherently evil, just that it creates a rather
> large dependency chain. For a C/C++ application that just wishes to use
> a Lucene Search, depending on a native C library can be much easier than
> a GCJ based library. I should however, hold judgment, I truthfully I
> haven't done much with GCJ, and my experiences with it are all
> relatively old.

Oh, I agree, it would be awfully nice to have a library fully
implemented in C, but without other people actually diving in to help
with the implementation I suspect that by the time I actually got a full
C implementation to the level of usefulness we'd need the GCJ compiler
would be so commonly installed on systems that the advantage in having
it in C would be gone ;-)

Plus, I used to have similar opinions on GCJ, but with the versions that
are out there now (I'm using a snapshot from GCC 4.0) it really seems to
"just work" for virtually everything I've tried to do.

> I can't argue on a technical basis that this new approach is wrong, I
> know that it makes the amount of code to be written is much smaller.

Dramatically. Progress is quite a bit faster than it ever was doing the
strait C reimplementation, and that's with me doing the C->C++->Java
glue code by hand. A lot of it is boilerplate (mainly the code that
does the exception->lcn_error_t conversion), and I suspect I can write a
script to generate it automatically.

> It would be interesting to try to combine the two, using the GCJ back
> end where no one has written the code yet, and native C when possible.
> I suspect the native C would never be completed however, since the GCJ
> back end would be 'good enough' for most people.

Yeah, I'd thought about that, and came to the same conclusion as you.

> I would like to see it when you get a chance, I want to integrate it
> with mod_mbox, but honestly, no rush. I doubt I will have time for
> mod_mbox hacking for about a week...

Well, just so you get an idea what I'm talking about, here's the current
'main.c' I'm testing with. The only /really/ new things compared to the
old Lucene4c interface is that now I'm working with utf-8 strings,
there's an index writer object, and you have to call some initialization
functions before you do anything to make the GCJ runtime happy. This
code actually does work, I just got it to add its first document and was
able to search for it via Java Lucene.

-garrett

#include <apr.h>

#include <stdio.h>

#include "lcn_init.h"
#include "lcn_document.h"
#include "lcn_pools.h"
#include "lcn_index_writer.h"

int
main (int argc, char *argv[])
{
apr_initialize ();

atexit (apr_terminate);

lcn_init ();

lcn_thread_attach ();

lcn_document_t *doc;
lcn_field_t *field;
lcn_error_t *err;

apr_pool_t *pool = lcn_pool_create (NULL);

err = lcn_document_create (&doc, pool);
if (err)
fprintf (stderr, "error creating document :(\n");

err = lcn_field_create (&field,
"path",
"foo",
LCN_STORED_YES,
LCN_INDEXED_UNTOKENIZED,
LCN_TERMVECTOR_NO,
pool);
if (err)
fprintf (stderr, "error creating field: %s\n", err->message);

err = lcn_document_add_field (doc, field);
if (err)
fprintf (stderr, "error adding field to document :(\n");

lcn_analyzer_t *analyzer;

err = lcn_analyzer_standard_create (&analyzer, pool);
if (err)
fprintf (stderr, "error creating analyzer: %s\n", err->message);

lcn_index_writer_t *writer;

err = lcn_index_writer_create (&writer, "index", analyzer, pool);
if (err)
fprintf (stderr, "error creating index: %s\n", err->message);

err = lcn_index_writer_add_document (writer, doc, pool);
if (err)
fprintf (stderr, "error adding document to index: %s\n", err->message);

lcn_pool_destroy (pool);

lcn_thread_detach ();

return 0;
}
Re: A change of direction [ In reply to ]
btw - there are some folks (including me until I was distracted for a
while) working on using gcj to make nice ruby bindings, and there are
a couple builds for it floating around (pylucene's and one from
someone in Java Lucene (Doug?)

-Brian


On May 15, 2005, at 8:44 PM, Garrett Rooney wrote:

> So I haven't been putting a whole lot of work into Lucene4c lately,
> mainly as a result of picking up and moving across the country for
> a new job. That's pretty much taken care of at this point, and I'm
> finally settled enough that I can pick up this project again.
>
> Well, actually, I've been at least sort of settled for the past few
> weeks, but every time I sat down and tried to get started with the
> part of Lucene4c I had been hacking on before I moved I just never
> got anywhere. So I put it down and stopped worrying about it for a
> while.
>
> Then, in the past few days I got motivated again, although it was
> in a slightly different way than I expected. I noticed that the
> linux distribution I had on my laptop actually happened to have a
> recent snapshot of GCJ on it, and since it'd been a while since I
> really played with GCJ I decided to give it a shot. Lucene4c was
> the obvious place to start playing.
>
> So far, I've got a reasonable start on a version of Lucene4c that
> sits on top of a GCJ compiled Lucene, using its CNI interfaces to
> call from a thin layer of C++ code into Java. Right now I've got
> it creating documents and fields, and later tonight I expect to be
> indexing documents. Hopefully searching will come reasonably soon
> after that.
>
> I think this is the shortest path towards actually having something
> that people can make use of, which will hopefully lead to this
> becoming a project with more people than just me hacking on it ;-)
>
> Does this mean I'm getting rid of the C implementation? Well, yes
> and no. I'm not planning on hacking on it in the near future, but
> there is a large chunk of useful code there, and I'm not just going
> to delete it or something. When writing the interface to the GCJ
> backed code I'm going to make a concerted effort to make it
> possible to plug a C implementation in as the backend without too
> much trouble, so perhaps in the future someone will be able to
> revisit the code and make it useful again.
>
> Anyway, if anyone wants to take a look at what I've got, just speak
> up and I'll commit it on a branch in the repository. I'm planning
> on doing that eventually anyway (probably once it hits "minimum
> level of functionality", which at this point means "basic indexing
> and searching is working"), but if other people want to see what
> I've got I'm happy to speed up the process.
>
> -garrett
>
>
Re: A change of direction [ In reply to ]
Brian McCallister wrote:
> btw - there are some folks (including me until I was distracted for a
> while) working on using gcj to make nice ruby bindings, and there are a
> couple builds for it floating around (pylucene's and one from someone
> in Java Lucene (Doug?)

Yeah, I've been using the PyLucene Makefile as a reference when I need
to figure out how to make GCJ do what I want. PyLucene.i also revealed
a little of the magic required in various places (like: "oh, that's how
you force it to preload that class and make the static members initialize").

-garrett