Mailing List Archive: Replicator PrimaryNode waits forever for remotes to close

Replicator PrimaryNode waits forever for remotes to close

Jun 29, 2022, 4:35 PM

Post #1 of 3 (258 views)

Hi Lucene fans,

We use lucene-replicator to copy our indexes from a primary to replica nodes.
Usually, startup and shutdown are fine. In particular we call PrimaryNode.close.

But, in some edge cases - dropped connection? IOException? some process crashed? -
we sometimes hang in PrimaryNode.waitForAllRemotesToClose, which never returns.

I suspect we have a reference counting bug: in some exceptional case, we forget to release our CopyState.
This definitely should be fixed, but in the meantime, it's very unhelpful for the primary node to never come down.

I was considering submitting a PR to add a configurable timeout for the shutdown wait - and after the timeout expires,
continue with closing even though some replicas did not terminate.
They will possibly crash with an "IOException: directory closed" later, or maybe never come back at all.

Does this sound like a welcome change? Is there a better way to avoid hanging here, other than to be bug-free?
It's quite challenging to figure out where the CopyState wasn't released, as only a count is kept.

Thanks!

Steven Schlansker

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Replicator PrimaryNode waits forever for remotes to close [ In reply to ]

lucene at mikemccandless

Jun 30, 2022, 10:40 AM

Post #2 of 3 (258 views)

Permalink

+1 to provide a timeout, or, to simply fix close to aggressively close
regardless of what the replicas are doing?

It's not a great design for primary to be so dependent on the replicas (but
vice/versa makes sense?).

Maybe open a Jira issue or starting PR so we can discuss?

Thanks for uncovering this and proposing a fix!

Mike McCandless

http://blog.mikemccandless.com

On Wed, Jun 29, 2022 at 7:36 PM Steven Schlansker <
stevenschlansker@gmail.com> wrote:

> Hi Lucene fans,
>
> We use lucene-replicator to copy our indexes from a primary to replica
> nodes.
> Usually, startup and shutdown are fine. In particular we call
> PrimaryNode.close.
>
> But, in some edge cases - dropped connection? IOException? some process
> crashed? -
> we sometimes hang in PrimaryNode.waitForAllRemotesToClose, which never
> returns.
>
> I suspect we have a reference counting bug: in some exceptional case, we
> forget to release our CopyState.
> This definitely should be fixed, but in the meantime, it's very unhelpful
> for the primary node to never come down.
>
> I was considering submitting a PR to add a configurable timeout for the
> shutdown wait - and after the timeout expires,
> continue with closing even though some replicas did not terminate.
> They will possibly crash with an "IOException: directory closed" later, or
> maybe never come back at all.
>
> Does this sound like a welcome change? Is there a better way to avoid
> hanging here, other than to be bug-free?
> It's quite challenging to figure out where the CopyState wasn't released,
> as only a count is kept.
>
> Thanks!
>
> Steven Schlansker
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: Replicator PrimaryNode waits forever for remotes to close [ In reply to ]

stevenschlansker at gmail

Jul 1, 2022, 1:11 PM

Post #3 of 3 (258 views)

Permalink

> On Jun 30, 2022, at 10:40 AM, Michael McCandless <lucene@mikemccandless.com> wrote:
>
> +1 to provide a timeout, or, to simply fix close to aggressively close regardless of what the replicas are doing?

Yes, aggressively closing would be great for us - we already expect the primary can and will crash, so an aggressive close is no worse than that.
I proposed the timeout on the theory that There Must Be A Reason It Is This Way :) but if the simpler solution is acceptable that's great for us!

> It's not a great design for primary to be so dependent on the replicas (but vice/versa makes sense?).

In our case, we use stateless HTTP to do the replication instead of the stateful sockets the reference implementation does.
This makes the reference counting for CopyState a little messy but has other benefits that for us outweigh the costs.
So for us, I think this might be the only place the primary depends on the replicas at all, and it'd be wonderful to break that dependency.

> Maybe open a Jira issue or starting PR so we can discuss?

I filed https://issues.apache.org/jira/browse/LUCENE-10638 for further discussion. Thanks!

> Thanks for uncovering this and proposing a fix!
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
>
> On Wed, Jun 29, 2022 at 7:36 PM Steven Schlansker <stevenschlansker@gmail.com> wrote:
> Hi Lucene fans,
>
> We use lucene-replicator to copy our indexes from a primary to replica nodes.
> Usually, startup and shutdown are fine. In particular we call PrimaryNode.close.
>
> But, in some edge cases - dropped connection? IOException? some process crashed? -
> we sometimes hang in PrimaryNode.waitForAllRemotesToClose, which never returns.
>
> I suspect we have a reference counting bug: in some exceptional case, we forget to release our CopyState.
> This definitely should be fixed, but in the meantime, it's very unhelpful for the primary node to never come down.
>
> I was considering submitting a PR to add a configurable timeout for the shutdown wait - and after the timeout expires,
> continue with closing even though some replicas did not terminate.
> They will possibly crash with an "IOException: directory closed" later, or maybe never come back at all.
>
> Does this sound like a welcome change? Is there a better way to avoid hanging here, other than to be bug-free?
> It's quite challenging to figure out where the CopyState wasn't released, as only a count is kept.
>
> Thanks!
>
> Steven Schlansker
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org