Mailing List Archive

Lucene Index Cloud Replication
Hi there,

I was talking with Varun at Berlin Buzzwords a couple of weeks ago about
storing and retrieving Lucene indexes in S3, and realized that "uploading a
Lucene directory to the cloud and downloading it on other machines" is a
pretty common problem and one that's surprisingly easy to do poorly. In my
current job, I'm on my third team that needed to do this.

In my experience, there are three main pieces that need to be implemented:

1. Uploading/downloading individual files (i.e. the blob store), which can
be eventually consistent if you write once.
2. Describing the metadata for a specific commit point (basically what the
Replicator module does with the "Revision" class). In particular, we want a
downloader to reliably be able to know if they already have specific files
(and don't need to download them again).
3. Sharing metadata with some degree of consistency, so that multiple
writers don't clobber each other's metadata, and so readers can discover
the metadata for the latest commit/revision and trust that they'll
(eventually) be able to download the relevant files.

I'd like to share what I've got for 1 and 3, based on S3 and DynamoDB, but
I'd like to do it with interfaces that lend themselves to other
implementations for blob and metadata storage.

Is it worth opening a Jira issue for this? Is this something that would
benefit the Lucene community?

Thanks,
Michael Froh
Re: Lucene Index Cloud Replication [ In reply to ]
+1 to share code for doing 1) and 3) both of which are tricky!

Safely moving / copying bytes around is a notoriously difficult problem ...
but Lucene's "end to end checksums" and per-segment-file-GUID make this
safer.

I think Lucene's replicator module is a good place for this?

Mike McCandless

http://blog.mikemccandless.com


On Wed, Jul 3, 2019 at 4:15 PM Michael Froh <msfroh@gmail.com> wrote:

> Hi there,
>
> I was talking with Varun at Berlin Buzzwords a couple of weeks ago about
> storing and retrieving Lucene indexes in S3, and realized that "uploading a
> Lucene directory to the cloud and downloading it on other machines" is a
> pretty common problem and one that's surprisingly easy to do poorly. In my
> current job, I'm on my third team that needed to do this.
>
> In my experience, there are three main pieces that need to be implemented:
>
> 1. Uploading/downloading individual files (i.e. the blob store), which can
> be eventually consistent if you write once.
> 2. Describing the metadata for a specific commit point (basically what the
> Replicator module does with the "Revision" class). In particular, we want a
> downloader to reliably be able to know if they already have specific files
> (and don't need to download them again).
> 3. Sharing metadata with some degree of consistency, so that multiple
> writers don't clobber each other's metadata, and so readers can discover
> the metadata for the latest commit/revision and trust that they'll
> (eventually) be able to download the relevant files.
>
> I'd like to share what I've got for 1 and 3, based on S3 and DynamoDB, but
> I'd like to do it with interfaces that lend themselves to other
> implementations for blob and metadata storage.
>
> Is it worth opening a Jira issue for this? Is this something that would
> benefit the Lucene community?
>
> Thanks,
> Michael Froh
>
Re: Lucene Index Cloud Replication [ In reply to ]
Another +1. We are also big s3 + lucene users and it is very interesting
what other people came up with. We have an S3 lucene directory that allows
immediate read-only use of lucene indexes stored on s3 with simultaneous
local caching and a prototype of segment based index replication based on
the custom deletion policy. Michael McCandless said it very well that both
Solr and ElasticSearch dont support segment based index distribution and
for large scale indexing this is very nice way of distributing lucene
indexes.

On Tue, Jul 9, 2019 at 8:51 AM Michael McCandless <lucene@mikemccandless.com>
wrote:

> +1 to share code for doing 1) and 3) both of which are tricky!
>
> Safely moving / copying bytes around is a notoriously difficult problem ...
> but Lucene's "end to end checksums" and per-segment-file-GUID make this
> safer.
>
> I think Lucene's replicator module is a good place for this?
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
>
> On Wed, Jul 3, 2019 at 4:15 PM Michael Froh <msfroh@gmail.com> wrote:
>
> > Hi there,
> >
> > I was talking with Varun at Berlin Buzzwords a couple of weeks ago about
> > storing and retrieving Lucene indexes in S3, and realized that
> "uploading a
> > Lucene directory to the cloud and downloading it on other machines" is a
> > pretty common problem and one that's surprisingly easy to do poorly. In
> my
> > current job, I'm on my third team that needed to do this.
> >
> > In my experience, there are three main pieces that need to be
> implemented:
> >
> > 1. Uploading/downloading individual files (i.e. the blob store), which
> can
> > be eventually consistent if you write once.
> > 2. Describing the metadata for a specific commit point (basically what
> the
> > Replicator module does with the "Revision" class). In particular, we
> want a
> > downloader to reliably be able to know if they already have specific
> files
> > (and don't need to download them again).
> > 3. Sharing metadata with some degree of consistency, so that multiple
> > writers don't clobber each other's metadata, and so readers can discover
> > the metadata for the latest commit/revision and trust that they'll
> > (eventually) be able to download the relevant files.
> >
> > I'd like to share what I've got for 1 and 3, based on S3 and DynamoDB,
> but
> > I'd like to do it with interfaces that lend themselves to other
> > implementations for blob and metadata storage.
> >
> > Is it worth opening a Jira issue for this? Is this something that would
> > benefit the Lucene community?
> >
> > Thanks,
> > Michael Froh
> >
>


--

Anton Zenkov | Director Of Engineering

azenkov@brandwatch.com


NEW YORK | BOSTON | BRIGHTON | LONDON | BERLIN | STUTTGART |
SINGAPORE | SYDNEY | PARIS


<https://www.brandwatch.com/blog/brandwatch-and-crimson-hexagon/>