Hi there,
I was talking with Varun at Berlin Buzzwords a couple of weeks ago about
storing and retrieving Lucene indexes in S3, and realized that "uploading a
Lucene directory to the cloud and downloading it on other machines" is a
pretty common problem and one that's surprisingly easy to do poorly. In my
current job, I'm on my third team that needed to do this.
In my experience, there are three main pieces that need to be implemented:
1. Uploading/downloading individual files (i.e. the blob store), which can
be eventually consistent if you write once.
2. Describing the metadata for a specific commit point (basically what the
Replicator module does with the "Revision" class). In particular, we want a
downloader to reliably be able to know if they already have specific files
(and don't need to download them again).
3. Sharing metadata with some degree of consistency, so that multiple
writers don't clobber each other's metadata, and so readers can discover
the metadata for the latest commit/revision and trust that they'll
(eventually) be able to download the relevant files.
I'd like to share what I've got for 1 and 3, based on S3 and DynamoDB, but
I'd like to do it with interfaces that lend themselves to other
implementations for blob and metadata storage.
Is it worth opening a Jira issue for this? Is this something that would
benefit the Lucene community?
Thanks,
Michael Froh
I was talking with Varun at Berlin Buzzwords a couple of weeks ago about
storing and retrieving Lucene indexes in S3, and realized that "uploading a
Lucene directory to the cloud and downloading it on other machines" is a
pretty common problem and one that's surprisingly easy to do poorly. In my
current job, I'm on my third team that needed to do this.
In my experience, there are three main pieces that need to be implemented:
1. Uploading/downloading individual files (i.e. the blob store), which can
be eventually consistent if you write once.
2. Describing the metadata for a specific commit point (basically what the
Replicator module does with the "Revision" class). In particular, we want a
downloader to reliably be able to know if they already have specific files
(and don't need to download them again).
3. Sharing metadata with some degree of consistency, so that multiple
writers don't clobber each other's metadata, and so readers can discover
the metadata for the latest commit/revision and trust that they'll
(eventually) be able to download the relevant files.
I'd like to share what I've got for 1 and 3, based on S3 and DynamoDB, but
I'd like to do it with interfaces that lend themselves to other
implementations for blob and metadata storage.
Is it worth opening a Jira issue for this? Is this something that would
benefit the Lucene community?
Thanks,
Michael Froh