Mailing List Archive: heartbeat gmain source priority inversion with rexmit and dead node detection

heartbeat gmain source priority inversion with rexmit and dead node detection

Apr 27, 2012, 7:11 AM

Post #1 of 3 (1771 views)

On Thu, Apr 26, 2012 at 10:56:30AM +0900, renayama19661014@ybb.ne.jp wrote:
> Hi All,
>
> We gave test that assumed remote cluster environment.
> And we tested packet lost.

You may be interested in this patch I have lying around for ages.

It may be incomplete for one corner case:
On a seriously misconfigured and overloaded system,
I have seen reports for a single send_local_status()
(that is basically one single send_cluster_msg())
which took longer to execute than deadtime
(without even returning to the mainloop!).

This cornercase should be handled with a watchdog.
But without a watchdog, and without stonith,
the CCM was confused, because one node saw a
leave then re-join after partition event, while the other node did not
even notice it had left and rejoined the membership...
and pacemaker ended up being DC on both :-/

So I guess send_local_status() could do with an explicit call to
check_for_timeouts(), but that may need recursion protection.

I should really polish and push my queue some day soon...

Cheers,

diff --git a/heartbeat/hb_rexmit.c b/heartbeat/hb_rexmit.c
--- a/heartbeat/hb_rexmit.c
+++ b/heartbeat/hb_rexmit.c
@@ -168,6 +168,7 @@ send_rexmit_request( gpointer data)
if (STRNCMP_CONST(node->status, UPSTATUS) != 0 &&
STRNCMP_CONST(node->status, ACTIVESTATUS) !=0) {
/* no point requesting rexmit from a dead node. */
+ g_hash_table_remove(rexmit_hash_table, ri);
return FALSE;
}

@@ -243,7 +244,7 @@ schedule_rexmit_request(struct node_info
ri->seq = seq;
ri->node = node;

- sourceid = Gmain_timeout_add_full(G_PRIORITY_HIGH - 1, delay,
+ sourceid = Gmain_timeout_add_full(PRI_REXMIT, delay,
send_rexmit_request, ri, NULL);
G_main_setall_id(sourceid, "retransmit request", config->heartbeat_ms/2, 10);

diff --git a/heartbeat/heartbeat.c b/heartbeat/heartbeat.c
--- a/heartbeat/heartbeat.c
+++ b/heartbeat/heartbeat.c
@@ -1585,7 +1585,7 @@ master_control_process(void)

send_local_status();

- if (G_main_add_input(G_PRIORITY_HIGH, FALSE,
+ if (G_main_add_input(PRI_POLL, FALSE,
&polled_input_SourceFuncs) ==NULL){
cl_log(LOG_ERR, "master_control_process: G_main_add_input failed");
}
diff --git a/include/hb_api_core.h b/include/hb_api_core.h
--- a/include/hb_api_core.h
+++ b/include/hb_api_core.h
@@ -40,6 +40,12 @@
#define PRI_READPKT (PRI_SENDPKT+1)
#define PRI_FIFOMSG (PRI_READPKT+1)

+/* PRI_POLL is where the timeout checks on deadtime happen.
+ * Better be sure rexmit requests for lost packets
+ * from a now dead node do not preempt detecting it as being dead. */
+#define PRI_POLL (G_PRIORITY_HIGH)
+#define PRI_REXMIT PRI_POLL
+
#define PRI_CHECKSIGS (G_PRIORITY_DEFAULT)
#define PRI_FREEMSG (PRI_CHECKSIGS+1)
#define PRI_CLIENTMSG (PRI_FREEMSG+1)
_______________________________________________________
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/

Re: heartbeat gmain source priority inversion with rexmit and dead node detection [ In reply to ]

andrew at beekhof

Apr 29, 2012, 7:23 PM

Post #2 of 3 (1697 views)

Permalink

On Sat, Apr 28, 2012 at 12:11 AM, Lars Ellenberg
<lars.ellenberg@linbit.com> wrote:
> On Thu, Apr 26, 2012 at 10:56:30AM +0900, renayama19661014@ybb.ne.jp wrote:
>> Hi All,
>>
>> We gave test that assumed remote cluster environment.
>> And we tested packet lost.
>
> You may be interested in this patch I have lying around for ages.
>
> It may be incomplete for one corner case:
> On a seriously misconfigured and overloaded system,
> I have seen reports for a single send_local_status()
> (that is basically one single send_cluster_msg())
> which took longer to execute than deadtime
> (without even returning to the mainloop!).
>
> This cornercase should be handled with a watchdog.
> But without a watchdog, and without stonith,
> the CCM was confused, because one node saw a
> leave then re-join after partition event, while the other node did not
> even notice it had left and rejoined the membership...
> and pacemaker ended up being DC on both :-/

A side effect of the ccm being "really confused" I assume?

>
> So I guess send_local_status() could do with an explicit call to
> check_for_timeouts(), but that may need recursion protection.
>
>
> I should really polish and push my queue some day soon...
>
> Cheers,
>
>
> diff --git a/heartbeat/hb_rexmit.c b/heartbeat/hb_rexmit.c
> --- a/heartbeat/hb_rexmit.c
> +++ b/heartbeat/hb_rexmit.c
> @@ -168,6 +168,7 @@ send_rexmit_request( gpointer data)
> if (STRNCMP_CONST(node->status, UPSTATUS) != 0 &&
> STRNCMP_CONST(node->status, ACTIVESTATUS) !=0) {
> /* no point requesting rexmit from a dead node. */
> + g_hash_table_remove(rexmit_hash_table, ri);
> return FALSE;
> }
>
> @@ -243,7 +244,7 @@ schedule_rexmit_request(struct node_info
> ri->seq = seq;
> ri->node = node;
>
> - sourceid = Gmain_timeout_add_full(G_PRIORITY_HIGH - 1, delay,
> + sourceid = Gmain_timeout_add_full(PRI_REXMIT, delay,
> send_rexmit_request, ri, NULL);
> G_main_setall_id(sourceid, "retransmit request", config->heartbeat_ms/2, 10);
>
> diff --git a/heartbeat/heartbeat.c b/heartbeat/heartbeat.c
> --- a/heartbeat/heartbeat.c
> +++ b/heartbeat/heartbeat.c
> @@ -1585,7 +1585,7 @@ master_control_process(void)
>
> send_local_status();
>
> - if (G_main_add_input(G_PRIORITY_HIGH, FALSE,
> + if (G_main_add_input(PRI_POLL, FALSE,
> &polled_input_SourceFuncs) ==NULL){
> cl_log(LOG_ERR, "master_control_process: G_main_add_input failed");
> }
> diff --git a/include/hb_api_core.h b/include/hb_api_core.h
> --- a/include/hb_api_core.h
> +++ b/include/hb_api_core.h
> @@ -40,6 +40,12 @@
> #define PRI_READPKT (PRI_SENDPKT+1)
> #define PRI_FIFOMSG (PRI_READPKT+1)
>
> +/* PRI_POLL is where the timeout checks on deadtime happen.
> + * Better be sure rexmit requests for lost packets
> + * from a now dead node do not preempt detecting it as being dead. */
> +#define PRI_POLL (G_PRIORITY_HIGH)
> +#define PRI_REXMIT PRI_POLL
> +
> #define PRI_CHECKSIGS (G_PRIORITY_DEFAULT)
> #define PRI_FREEMSG (PRI_CHECKSIGS+1)
> #define PRI_CLIENTMSG (PRI_FREEMSG+1)
> _______________________________________________________
> Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
> http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
> Home Page: http://linux-ha.org/
_______________________________________________________
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/

Re: heartbeat gmain source priority inversion with rexmit and dead node detection [ In reply to ]

lars.ellenberg at linbit

Apr 30, 2012, 4:32 AM

Post #3 of 3 (1690 views)

Permalink

On Mon, Apr 30, 2012 at 12:23:56PM +1000, Andrew Beekhof wrote:
> On Sat, Apr 28, 2012 at 12:11 AM, Lars Ellenberg
> <lars.ellenberg@linbit.com> wrote:
> > On Thu, Apr 26, 2012 at 10:56:30AM +0900, renayama19661014@ybb.ne.jp wrote:
> >> Hi All,
> >>
> >> We gave test that assumed remote cluster environment.
> >> And we tested packet lost.
> >
> > You may be interested in this patch I have lying around for ages.
> >
> > It may be incomplete for one corner case:
> > On a seriously misconfigured and overloaded system,
> > I have seen reports for a single send_local_status()
> > (that is basically one single send_cluster_msg())
> > which took longer to execute than deadtime
> > (without even returning to the mainloop!).
> >
> > This cornercase should be handled with a watchdog.
> > But without a watchdog, and without stonith,
> > the CCM was confused, because one node saw a
> > leave then re-join after partition event, while the other node did not
> > even notice it had left and rejoined the membership...
> > and pacemaker ended up being DC on both :-/
>
> A side effect of the ccm being "really confused" I assume?

I guess so, yes.
Not sure if pacemaker could have handled it differently,
based on the input it was fed from ccm.

At some point, Pacemaker complained about "Another DC detected",
but things never really recovered.
If I can dig up the logs again, I'll show you some lines.

But then, no stonith, no watchdog, and system overloaded to the point
where processing a single mainloop dispatch callback takes longer
than what is supposed to be the deadtime, and that is within the
heartbeat communication main processes, which are supposed to be
realtime...

I don't think additional paranoia code would do much good,
on any level.

But thanks for your attention ;-)

> >
> > So I guess send_local_status() could do with an explicit call to
> > check_for_timeouts(), but that may need recursion protection.
> >
> >
> > I should really polish and push my queue some day soon...
> >
> > Cheers,
> >
> >
> > diff --git a/heartbeat/hb_rexmit.c b/heartbeat/hb_rexmit.c
...

--
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
_______________________________________________________
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/