Mailing List Archive: Current Status + Bite Sized Tasks

Well, now that we have a mailing list I suppose I should try to sum up
the current state of the project...

I've begun work on actual searching, which currently means I've started
implementing querys and scorers. So far I've got term and boolean
queries, with corresponding scorers.

The scorers are kind of half-assed at this point, since they don't
actually calculate any scores at all, it's all a boolean "it's there or
it's not" kind of thing.

The boolean scorer also doesn't have support for 'not' queries, so while
you can do "foo AND bar" or "foo OR bar", you can't do "foo AND !bar"
quite yet. This will hopefully be cleared up in the next few days.

I've also started polishing up the support for non-optimized indices,
and yesterday I got our first queries that return results from multiple
segments to work correctly. I still need to add an unoptimized test
index and some tests to the test suite, again this will hopefully happen
real soon now (tm).

There are a number of places people could dive in now if they're looking
for things to do, some easier than others. Here's a few, just off the
top of my head.

We don't currently handle deleted documents. This should be pretty easy
to add, I just haven't gotten around to it. It's just a matter of
parsing the deleted file and checking the set of deleted docs before
returning a hit.

There's currently no query parser. I've played with the lemon parser
generator a bit, and that's what I'd like to use, but really I'd love to
see any kind of parser contributed, since it'd be nice if people didn't
have to manually assemble queries themselves.

The scorers don't actually compute scores. Fixing this involves
figuring out how Lucene is actually computing scores and implementing
the code to read various related bits from the index, which I haven't
gotten around to yet.

We need a higher level interface to run searches, analagous to an
IndexSearcher in Java Lucene. This needs to take scores into account,
ordering the results returned, so it really depends on the previous task.

There are various queries and scorers that still need to be implemented.

The various APIs that iterate over documents need to be evaluated.
Currently we signal the end of iteration with an lcn_error_t, that's
probably rather heavyweight, so we probably want to return booleans
instead while still maintaining an API that looks reasonable when called
in a loop.

The error code currently makes use of APR status codes, we really need
our own set of return values, like those in Subversion. Then once we
have them our various return values need to be evaluated to see if a
lucene specific error is more appropriate, and any spot that depends on
the current value needs to be corrected.

All of these will eventually go into JIRA once it's set up, but for now
it'll just have to live on the mailing list. If anyone has any comments
or questions on how to get started on a task feel free to ask.

-garrett