Mailing List Archive

[Patch] The patch which revises memory leak.
Hi All,

We gave test that assumed remote cluster environment.
And we tested packet lost.

The retransmission timer of Heartbeat causes memory leak.

I donate a patch.
Please confirm the contents of the patch.
And please reflect a patch in a repository of Heartbeat.

Best Regards,
Hideo Yamauchi.
Re: [Patch] The patch which revises memory leak. [ In reply to ]
On Thu, Apr 26, 2012 at 10:56:30AM +0900, renayama19661014@ybb.ne.jp wrote:
> Hi All,
>
> We gave test that assumed remote cluster environment.
> And we tested packet lost.
>
> The retransmission timer of Heartbeat causes memory leak.
>
> I donate a patch.
> Please confirm the contents of the patch.
> And please reflect a patch in a repository of Heartbeat.

Have you actually been able to measure that memory leak you observed,
and you can confirm this patch will fix it?

Because I don't think this patch has any effect.

send_rexmit_request() is only used as paramter to
Gmain_timeout_add_full, and it returns FALSE always,
which should cause the respective sourceid to be auto-removed.


> diff -r 106ca984041b heartbeat/hb_rexmit.c
> --- a/heartbeat/hb_rexmit.c Thu Apr 26 19:28:26 2012 +0900
> +++ b/heartbeat/hb_rexmit.c Thu Apr 26 19:31:44 2012 +0900
> @@ -164,6 +164,8 @@
> seqno_t seq = (seqno_t) ri->seq;
> struct node_info* node = ri->node;
> struct ha_msg* hmsg;
> + unsigned long sourceid;
> + gpointer value;
>
> if (STRNCMP_CONST(node->status, UPSTATUS) != 0 &&
> STRNCMP_CONST(node->status, ACTIVESTATUS) !=0) {
> @@ -196,11 +198,17 @@
>
> node->track.last_rexmit_req = time_longclock();
>
> - if (!g_hash_table_remove(rexmit_hash_table, ri)){
> - cl_log(LOG_ERR, "%s: entry not found in rexmit_hash_table"
> - "for seq/node(%ld %s)",
> - __FUNCTION__, ri->seq, ri->node->nodename);
> - return FALSE;
> + value = g_hash_table_lookup(rexmit_hash_table, ri);
> + if ( value != NULL) {
> + sourceid = (unsigned long) value;
> + Gmain_timeout_remove(sourceid);
> +
> + if (!g_hash_table_remove(rexmit_hash_table, ri)){
> + cl_log(LOG_ERR, "%s: entry not found in rexmit_hash_table"
> + "for seq/node(%ld %s)",
> + __FUNCTION__, ri->seq, ri->node->nodename);
> + return FALSE;
> + }
> }
>
> schedule_rexmit_request(node, seq, max_rexmit_delay);


--
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
_______________________________________________________
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/
Re: [Patch] The patch which revises memory leak. [ In reply to ]
Hi Lars,

Thank you for comments.

> Have you actually been able to measure that memory leak you observed,
> and you can confirm this patch will fix it?
>
> Because I don't think this patch has any effect.

Yes.
I really measured leak.
I can show a result next week.
#Japan is a holiday until Tuesday.

>
> send_rexmit_request() is only used as paramter to
> Gmain_timeout_add_full, and it returns FALSE always,
> which should cause the respective sourceid to be auto-removed.

It seems to be necessary to release gsource somehow or other.
The similar liberation seems to be carried out in lrmd.

Best Regards,
Hideo Yamauchi.


--- On Fri, 2012/4/27, Lars Ellenberg <lars.ellenberg@linbit.com> wrote:

> On Thu, Apr 26, 2012 at 10:56:30AM +0900, renayama19661014@ybb.ne.jp wrote:
> > Hi All,
> >
> > We gave test that assumed remote cluster environment.
> > And we tested packet lost.
> >
> > The retransmission timer of Heartbeat causes memory leak.
> >
> > I donate a patch.
> > Please confirm the contents of the patch.
> > And please reflect a patch in a repository of Heartbeat.
>
> Have you actually been able to measure that memory leak you observed,
> and you can confirm this patch will fix it?
>
> Because I don't think this patch has any effect.
>
> send_rexmit_request() is only used as paramter to
> Gmain_timeout_add_full, and it returns FALSE always,
> which should cause the respective sourceid to be auto-removed.
>
>
> > diff -r 106ca984041b heartbeat/hb_rexmit.c
> > --- a/heartbeat/hb_rexmit.c    Thu Apr 26 19:28:26 2012 +0900
> > +++ b/heartbeat/hb_rexmit.c    Thu Apr 26 19:31:44 2012 +0900
> > @@ -164,6 +164,8 @@
> >      seqno_t seq = (seqno_t) ri->seq;
> >      struct node_info* node = ri->node;
> >      struct ha_msg*    hmsg;
> > +    unsigned long           sourceid;
> > +    gpointer value;
> > 
> >      if (STRNCMP_CONST(node->status, UPSTATUS) != 0 &&
> >          STRNCMP_CONST(node->status, ACTIVESTATUS) !=0) {
> > @@ -196,11 +198,17 @@
> >     
> >      node->track.last_rexmit_req = time_longclock();   
> >     
> > -    if (!g_hash_table_remove(rexmit_hash_table, ri)){
> > -        cl_log(LOG_ERR, "%s: entry not found in rexmit_hash_table"
> > -               "for seq/node(%ld %s)",                
> > -               __FUNCTION__, ri->seq, ri->node->nodename);
> > -        return FALSE;
> > +    value = g_hash_table_lookup(rexmit_hash_table, ri);
> > +    if ( value != NULL) {
> > +        sourceid = (unsigned long) value;
> > +        Gmain_timeout_remove(sourceid);
> > +
> > +        if (!g_hash_table_remove(rexmit_hash_table, ri)){
> > +            cl_log(LOG_ERR, "%s: entry not found in rexmit_hash_table"
> > +                       "for seq/node(%ld %s)",                
> > +                       __FUNCTION__, ri->seq, ri->node->nodename);
> > +            return FALSE;
> > +        }
> >      }
> >     
> >      schedule_rexmit_request(node, seq, max_rexmit_delay);
>
>
> --
> : Lars Ellenberg
> : LINBIT | Your Way to High Availability
> : DRBD/HA support and consulting http://www.linbit.com
>
> DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
> _______________________________________________________
> Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
> http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
> Home Page: http://linux-ha.org/
>
_______________________________________________________
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/
Re: [Patch] The patch which revises memory leak. [ In reply to ]
Hi Lars,

We confirmed that this problem occurred with v1 mode of Heartbeat.
* The problem happens with the v2 mode in the same way.

We confirmed a problem in the next procedure.

Step 1) Put a special device extinguishing a communication packet of Heartbeat in the network.

Step 2) Between nodes, the retransmission of the message is carried out repeatedly.

Step 3) Then the memory of the master process increases little by little.


-------- As a result of the ps command of the master process ----------
* node1
(start)
32126 ? SLs 0:00 0 182 53989 7128 0.0 heartbeat: master control process
(One hour later)
32126 ? SLs 0:03 0 182 54729 7868 0.0 heartbeat: master control process
(Two hour later)
32126 ? SLs 0:08 0 182 55317 8456 0.0 heartbeat: master control process
(Four hours later)
32126 ? SLs 0:24 0 182 56673 9812 0.0 heartbeat: master control process

* node2
(start)
31928 ? SLs 0:00 0 182 53989 7128 0.0 heartbeat: master control process
(One hour later)
31928 ? SLs 0:02 0 182 54481 7620 0.0 heartbeat: master control process
(Two hour later)
31928 ? SLs 0:08 0 182 55353 8492 0.0 heartbeat: master control process
(Four hours later)
31928 ? SLs 0:23 0 182 56689 9828 0.0 heartbeat: master control process


The state of the memory leak seems to vary according to a node with the quantity of the retransmission.

The increase of this memory disappears by applying my patch.

And the similar correspondence seems to be necessary in send_reqnodes_msg(), but this is like little leak.

Best Regards,
Hideo Yamauchi.


--- On Sat, 2012/4/28, renayama19661014@ybb.ne.jp <renayama19661014@ybb.ne.jp> wrote:

> Hi Lars,
>
> Thank you for comments.
>
> > Have you actually been able to measure that memory leak you observed,
> > and you can confirm this patch will fix it?
> >
> > Because I don't think this patch has any effect.
>
> Yes.
> I really measured leak.
> I can show a result next week.
> #Japan is a holiday until Tuesday.
>
> >
> > send_rexmit_request() is only used as paramter to
> > Gmain_timeout_add_full, and it returns FALSE always,
> > which should cause the respective sourceid to be auto-removed.
>
> It seems to be necessary to release gsource somehow or other.
> The similar liberation seems to be carried out in lrmd.
>
> Best Regards,
> Hideo Yamauchi.
>
>
> --- On Fri, 2012/4/27, Lars Ellenberg <lars.ellenberg@linbit.com> wrote:
>
> > On Thu, Apr 26, 2012 at 10:56:30AM +0900, renayama19661014@ybb.ne.jp wrote:
> > > Hi All,
> > >
> > > We gave test that assumed remote cluster environment.
> > > And we tested packet lost.
> > >
> > > The retransmission timer of Heartbeat causes memory leak.
> > >
> > > I donate a patch.
> > > Please confirm the contents of the patch.
> > > And please reflect a patch in a repository of Heartbeat.
> >
> > Have you actually been able to measure that memory leak you observed,
> > and you can confirm this patch will fix it?
> >
> > Because I don't think this patch has any effect.
> >
> > send_rexmit_request() is only used as paramter to
> > Gmain_timeout_add_full, and it returns FALSE always,
> > which should cause the respective sourceid to be auto-removed.
> >
> >
> > > diff -r 106ca984041b heartbeat/hb_rexmit.c
> > > --- a/heartbeat/hb_rexmit.c    Thu Apr 26 19:28:26 2012 +0900
> > > +++ b/heartbeat/hb_rexmit.c    Thu Apr 26 19:31:44 2012 +0900
> > > @@ -164,6 +164,8 @@
> > >      seqno_t seq = (seqno_t) ri->seq;
> > >      struct node_info* node = ri->node;
> > >      struct ha_msg*    hmsg;
> > > +    unsigned long           sourceid;
> > > +    gpointer value;
> > > 
> > >      if (STRNCMP_CONST(node->status, UPSTATUS) != 0 &&
> > >          STRNCMP_CONST(node->status, ACTIVESTATUS) !=0) {
> > > @@ -196,11 +198,17 @@
> > >     
> > >      node->track.last_rexmit_req = time_longclock();   
> > >     
> > > -    if (!g_hash_table_remove(rexmit_hash_table, ri)){
> > > -        cl_log(LOG_ERR, "%s: entry not found in rexmit_hash_table"
> > > -               "for seq/node(%ld %s)",                
> > > -               __FUNCTION__, ri->seq, ri->node->nodename);
> > > -        return FALSE;
> > > +    value = g_hash_table_lookup(rexmit_hash_table, ri);
> > > +    if ( value != NULL) {
> > > +        sourceid = (unsigned long) value;
> > > +        Gmain_timeout_remove(sourceid);
> > > +
> > > +        if (!g_hash_table_remove(rexmit_hash_table, ri)){
> > > +            cl_log(LOG_ERR, "%s: entry not found in rexmit_hash_table"
> > > +                       "for seq/node(%ld %s)",                
> > > +                       __FUNCTION__, ri->seq, ri->node->nodename);
> > > +            return FALSE;
> > > +        }
> > >      }
> > >     
> > >      schedule_rexmit_request(node, seq, max_rexmit_delay);
> >
> >
> > --
> > : Lars Ellenberg
> > : LINBIT | Your Way to High Availability
> > : DRBD/HA support and consulting http://www.linbit.com
> >
> > DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
> > _______________________________________________________
> > Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
> > http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
> > Home Page: http://linux-ha.org/
> >
> _______________________________________________________
> Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
> http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
> Home Page: http://linux-ha.org/
>
_______________________________________________________
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/
Re: [Patch] The patch which revises memory leak. [ In reply to ]
Hi Lars,

And when it passes more than a full day....

* node1
32126 ? SLs 79:52 0 182 71189 24328 0.1 heartbeat: master control process

* node2
31928 ? SLs 77:01 0 182 70869 24008 0.1 heartbeat: master control process


Best Regards,
Hideo Yamauchi.


--- On Tue, 2012/5/1, renayama19661014@ybb.ne.jp <renayama19661014@ybb.ne.jp> wrote:

> Hi Lars,
>
> We confirmed that this problem occurred with v1 mode of Heartbeat.
> * The problem happens with the v2 mode in the same way.
>
> We confirmed a problem in the next procedure.
>
> Step 1) Put a special device extinguishing a communication packet of Heartbeat in the network.
>
> Step 2) Between nodes, the retransmission of the message is carried out repeatedly.
>
> Step 3) Then the memory of the master process increases little by little.
>
>
> -------- As a result of the ps command of the master process ----------
> * node1
> (start)
> 32126 ?        SLs    0:00      0   182 53989  7128  0.0 heartbeat: master control process
> (One hour later)
> 32126 ?        SLs    0:03      0   182 54729  7868  0.0 heartbeat: master control process
> (Two hour later)
> 32126 ?        SLs    0:08      0   182 55317  8456  0.0 heartbeat: master control process
> (Four hours later)
> 32126 ?        SLs    0:24      0   182 56673  9812  0.0 heartbeat: master control process
>
> * node2
> (start)
> 31928 ?        SLs    0:00      0   182 53989  7128  0.0 heartbeat: master control process
> (One hour later)
> 31928 ?        SLs    0:02      0   182 54481  7620  0.0 heartbeat: master control process
> (Two hour later)
> 31928 ?        SLs    0:08      0   182 55353  8492  0.0 heartbeat: master control process
> (Four hours later)
> 31928 ?        SLs    0:23      0   182 56689  9828  0.0 heartbeat: master control process
>
>
> The state of the memory leak seems to vary according to a node with the quantity of the retransmission.
>
> The increase of this memory disappears by applying my patch.
>
> And the similar correspondence seems to be necessary in send_reqnodes_msg(), but this is like little leak.
>
> Best Regards,
> Hideo Yamauchi.
>
>
> --- On Sat, 2012/4/28, renayama19661014@ybb.ne.jp <renayama19661014@ybb.ne.jp> wrote:
>
> > Hi Lars,
> >
> > Thank you for comments.
> >
> > > Have you actually been able to measure that memory leak you observed,
> > > and you can confirm this patch will fix it?
> > >
> > > Because I don't think this patch has any effect.
> >
> > Yes.
> > I really measured leak.
> > I can show a result next week.
> > #Japan is a holiday until Tuesday.
> >
> > >
> > > send_rexmit_request() is only used as paramter to
> > > Gmain_timeout_add_full, and it returns FALSE always,
> > > which should cause the respective sourceid to be auto-removed.
> >
> > It seems to be necessary to release gsource somehow or other.
> > The similar liberation seems to be carried out in lrmd.
> >
> > Best Regards,
> > Hideo Yamauchi.
> >
> >
> > --- On Fri, 2012/4/27, Lars Ellenberg <lars.ellenberg@linbit.com> wrote:
> >
> > > On Thu, Apr 26, 2012 at 10:56:30AM +0900, renayama19661014@ybb.ne.jp wrote:
> > > > Hi All,
> > > >
> > > > We gave test that assumed remote cluster environment.
> > > > And we tested packet lost.
> > > >
> > > > The retransmission timer of Heartbeat causes memory leak.
> > > >
> > > > I donate a patch.
> > > > Please confirm the contents of the patch.
> > > > And please reflect a patch in a repository of Heartbeat.
> > >
> > > Have you actually been able to measure that memory leak you observed,
> > > and you can confirm this patch will fix it?
> > >
> > > Because I don't think this patch has any effect.
> > >
> > > send_rexmit_request() is only used as paramter to
> > > Gmain_timeout_add_full, and it returns FALSE always,
> > > which should cause the respective sourceid to be auto-removed.
> > >
> > >
> > > > diff -r 106ca984041b heartbeat/hb_rexmit.c
> > > > --- a/heartbeat/hb_rexmit.c    Thu Apr 26 19:28:26 2012 +0900
> > > > +++ b/heartbeat/hb_rexmit.c    Thu Apr 26 19:31:44 2012 +0900
> > > > @@ -164,6 +164,8 @@
> > > >      seqno_t seq = (seqno_t) ri->seq;
> > > >      struct node_info* node = ri->node;
> > > >      struct ha_msg*    hmsg;
> > > > +    unsigned long           sourceid;
> > > > +    gpointer value;
> > > > 
> > > >      if (STRNCMP_CONST(node->status, UPSTATUS) != 0 &&
> > > >          STRNCMP_CONST(node->status, ACTIVESTATUS) !=0) {
> > > > @@ -196,11 +198,17 @@
> > > >     
> > > >      node->track.last_rexmit_req = time_longclock();   
> > > >     
> > > > -    if (!g_hash_table_remove(rexmit_hash_table, ri)){
> > > > -        cl_log(LOG_ERR, "%s: entry not found in rexmit_hash_table"
> > > > -               "for seq/node(%ld %s)",                
> > > > -               __FUNCTION__, ri->seq, ri->node->nodename);
> > > > -        return FALSE;
> > > > +    value = g_hash_table_lookup(rexmit_hash_table, ri);
> > > > +    if ( value != NULL) {
> > > > +        sourceid = (unsigned long) value;
> > > > +        Gmain_timeout_remove(sourceid);
> > > > +
> > > > +        if (!g_hash_table_remove(rexmit_hash_table, ri)){
> > > > +            cl_log(LOG_ERR, "%s: entry not found in rexmit_hash_table"
> > > > +                       "for seq/node(%ld %s)",                
> > > > +                       __FUNCTION__, ri->seq, ri->node->nodename);
> > > > +            return FALSE;
> > > > +        }
> > > >      }
> > > >     
> > > >      schedule_rexmit_request(node, seq, max_rexmit_delay);
> > >
> > >
> > > --
> > > : Lars Ellenberg
> > > : LINBIT | Your Way to High Availability
> > > : DRBD/HA support and consulting http://www.linbit.com
> > >
> > > DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
> > > _______________________________________________________
> > > Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
> > > http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
> > > Home Page: http://linux-ha.org/
> > >
> > _______________________________________________________
> > Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
> > http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
> > Home Page: http://linux-ha.org/
> >
> _______________________________________________________
> Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
> http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
> Home Page: http://linux-ha.org/
>
_______________________________________________________
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/
Re: [Patch] The patch which revises memory leak. [ In reply to ]
On Wed, May 02, 2012 at 10:43:36AM +0900, renayama19661014@ybb.ne.jp wrote:
> Hi Lars,
>
> And when it passes more than a full day....
>
> * node1
> 32126 ? SLs 79:52 0 182 71189 24328 0.1 heartbeat: master control process
>
> * node2
> 31928 ? SLs 77:01 0 182 70869 24008 0.1 heartbeat: master control process

Oh, I see.

This is a "design choice" (maybe not even intentional) of the Gmain_*
wrappers used throughout the heartbeat code.

The "real" glib g_timeout_add_full(), and most other similar functions,
internally do
id = g_source_attach(source, ...);
g_source_unref(source);
return id;

Thus in g_main_dispatch, the
need_destroy = ! dispatch (...)
if (need_destroy)
g_source_destroy_internal()

in fact ends up destroying it,
if dispatch() returns FALSE,
as documented:
The function is called repeatedly until it returns FALSE, at
which point the timeout is automatically destroyed and the
function will not be called again.

Not so with the heartbeat/glue/libplumbing Gmain_timeout_add_full.
It does not g_source_unref(), so we keep the extra reference around
until someone explicitly calls Gmain_timeout_remove().

Talk about principle of least surprise :(

Changing this behaviour to match glib's, i.e. unref'ing after
g_source_attach, would seem like the "correct" thing to do,
but is likely to break other pieces of code in subtle ways,
so it may not be the "right" thing to do at this point.

I'm going to take your patch more or less as is.
If it does not show up soon, prod me again.

Thank you for tracking this down.

> Best Regards,
> Hideo Yamauchi.
>
>
> --- On Tue, 2012/5/1, renayama19661014@ybb.ne.jp <renayama19661014@ybb.ne.jp> wrote:
>
> > Hi Lars,
> >
> > We confirmed that this problem occurred with v1 mode of Heartbeat.
> > * The problem happens with the v2 mode in the same way.
> >
> > We confirmed a problem in the next procedure.
> >
> > Step 1) Put a special device extinguishing a communication packet of Heartbeat in the network.
> >
> > Step 2) Between nodes, the retransmission of the message is carried out repeatedly.
> >
> > Step 3) Then the memory of the master process increases little by little.
> >
> >
> > -------- As a result of the ps command of the master process ----------
> > * node1
> > (start)
> > 32126 ?        SLs    0:00      0   182 53989  7128  0.0 heartbeat: master control process
> > (One hour later)
> > 32126 ?        SLs    0:03      0   182 54729  7868  0.0 heartbeat: master control process
> > (Two hour later)
> > 32126 ?        SLs    0:08      0   182 55317  8456  0.0 heartbeat: master control process
> > (Four hours later)
> > 32126 ?        SLs    0:24      0   182 56673  9812  0.0 heartbeat: master control process
> >
> > * node2
> > (start)
> > 31928 ?        SLs    0:00      0   182 53989  7128  0.0 heartbeat: master control process
> > (One hour later)
> > 31928 ?        SLs    0:02      0   182 54481  7620  0.0 heartbeat: master control process
> > (Two hour later)
> > 31928 ?        SLs    0:08      0   182 55353  8492  0.0 heartbeat: master control process
> > (Four hours later)
> > 31928 ?        SLs    0:23      0   182 56689  9828  0.0 heartbeat: master control process
> >
> >
> > The state of the memory leak seems to vary according to a node with the quantity of the retransmission.
> >
> > The increase of this memory disappears by applying my patch.
> >
> > And the similar correspondence seems to be necessary in send_reqnodes_msg(), but this is like little leak.
> >
> > Best Regards,
> > Hideo Yamauchi.
> >
> >
> > --- On Sat, 2012/4/28, renayama19661014@ybb.ne.jp <renayama19661014@ybb.ne.jp> wrote:
> >
> > > Hi Lars,
> > >
> > > Thank you for comments.
> > >
> > > > Have you actually been able to measure that memory leak you observed,
> > > > and you can confirm this patch will fix it?
> > > >
> > > > Because I don't think this patch has any effect.
> > >
> > > Yes.
> > > I really measured leak.
> > > I can show a result next week.
> > > #Japan is a holiday until Tuesday.
> > >
> > > >
> > > > send_rexmit_request() is only used as paramter to
> > > > Gmain_timeout_add_full, and it returns FALSE always,
> > > > which should cause the respective sourceid to be auto-removed.
> > >
> > > It seems to be necessary to release gsource somehow or other.
> > > The similar liberation seems to be carried out in lrmd.
> > >
> > > Best Regards,
> > > Hideo Yamauchi.
> > >
> > >
> > > --- On Fri, 2012/4/27, Lars Ellenberg <lars.ellenberg@linbit.com> wrote:
> > >
> > > > On Thu, Apr 26, 2012 at 10:56:30AM +0900, renayama19661014@ybb.ne.jp wrote:
> > > > > Hi All,
> > > > >
> > > > > We gave test that assumed remote cluster environment.
> > > > > And we tested packet lost.
> > > > >
> > > > > The retransmission timer of Heartbeat causes memory leak.
> > > > >
> > > > > I donate a patch.
> > > > > Please confirm the contents of the patch.
> > > > > And please reflect a patch in a repository of Heartbeat.
> > > >
> > > > Have you actually been able to measure that memory leak you observed,
> > > > and you can confirm this patch will fix it?
> > > >
> > > > Because I don't think this patch has any effect.
> > > >
> > > > send_rexmit_request() is only used as paramter to
> > > > Gmain_timeout_add_full, and it returns FALSE always,
> > > > which should cause the respective sourceid to be auto-removed.
> > > >
> > > >
> > > > > diff -r 106ca984041b heartbeat/hb_rexmit.c
> > > > > --- a/heartbeat/hb_rexmit.c    Thu Apr 26 19:28:26 2012 +0900
> > > > > +++ b/heartbeat/hb_rexmit.c    Thu Apr 26 19:31:44 2012 +0900
> > > > > @@ -164,6 +164,8 @@
> > > > >      seqno_t seq = (seqno_t) ri->seq;
> > > > >      struct node_info* node = ri->node;
> > > > >      struct ha_msg*    hmsg;
> > > > > +    unsigned long           sourceid;
> > > > > +    gpointer value;
> > > > > 
> > > > >      if (STRNCMP_CONST(node->status, UPSTATUS) != 0 &&
> > > > >          STRNCMP_CONST(node->status, ACTIVESTATUS) !=0) {
> > > > > @@ -196,11 +198,17 @@
> > > > >     
> > > > >      node->track.last_rexmit_req = time_longclock();   
> > > > >     
> > > > > -    if (!g_hash_table_remove(rexmit_hash_table, ri)){
> > > > > -        cl_log(LOG_ERR, "%s: entry not found in rexmit_hash_table"
> > > > > -               "for seq/node(%ld %s)",                
> > > > > -               __FUNCTION__, ri->seq, ri->node->nodename);
> > > > > -        return FALSE;
> > > > > +    value = g_hash_table_lookup(rexmit_hash_table, ri);
> > > > > +    if ( value != NULL) {
> > > > > +        sourceid = (unsigned long) value;
> > > > > +        Gmain_timeout_remove(sourceid);
> > > > > +
> > > > > +        if (!g_hash_table_remove(rexmit_hash_table, ri)){
> > > > > +            cl_log(LOG_ERR, "%s: entry not found in rexmit_hash_table"
> > > > > +                       "for seq/node(%ld %s)",                
> > > > > +                       __FUNCTION__, ri->seq, ri->node->nodename);
> > > > > +            return FALSE;
> > > > > +        }
> > > > >      }
> > > > >     
> > > > >      schedule_rexmit_request(node, seq, max_rexmit_delay);

--
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
_______________________________________________________
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/
Re: [Patch] The patch which revises memory leak. [ In reply to ]
Hi Lars,

Thank you for comments.

> >
> > And when it passes more than a full day....
> >
> > * node1
> > 32126 ?        SLs   79:52      0   182 71189 24328  0.1 heartbeat: master control process                       
> >
> > * node2
> > 31928 ?        SLs   77:01      0   182 70869 24008  0.1 heartbeat: master control process
>
> Oh, I see.
>
> This is a "design choice" (maybe not even intentional) of the Gmain_*
> wrappers used throughout the heartbeat code.
>
> The "real" glib g_timeout_add_full(), and most other similar functions,
> internally do
> id = g_source_attach(source, ...);
> g_source_unref(source);
> return id;
>
> Thus in g_main_dispatch, the
> need_destroy = ! dispatch (...)
> if (need_destroy)
>     g_source_destroy_internal()
>
> in fact ends up destroying it,
> if dispatch() returns FALSE,
> as documented:
>     The function is called repeatedly until it returns FALSE, at
>     which point the timeout is automatically destroyed and the
>     function will not be called again.
>
> Not so with the heartbeat/glue/libplumbing Gmain_timeout_add_full.
> It does not g_source_unref(), so we keep the extra reference around
> until someone explicitly calls Gmain_timeout_remove().
>
> Talk about principle of least surprise :(
>
> Changing this behaviour to match glib's, i.e. unref'ing after
> g_source_attach, would seem like the "correct" thing to do,
> but is likely to break other pieces of code in subtle ways,
> so it may not be the "right" thing to do at this point.

Thank you for detailed explanation.
If you found the method that is appropriate than the correction that I suggested, I approve of it.

> I'm going to take your patch more or less as is.
> If it does not show up soon, prod me again.
>

All right.

Many Thanks!
Hideo Yamauchi.


> Thank you for tracking this down.
>
> > Best Regards,
> > Hideo Yamauchi.
> >
> >
> > --- On Tue, 2012/5/1, renayama19661014@ybb.ne.jp <renayama19661014@ybb.ne.jp> wrote:
> >
> > > Hi Lars,
> > >
> > > We confirmed that this problem occurred with v1 mode of Heartbeat.
> > >  * The problem happens with the v2 mode in the same way.
> > >
> > > We confirmed a problem in the next procedure.
> > >
> > > Step 1) Put a special device extinguishing a communication packet of Heartbeat in the network.
> > >
> > > Step 2) Between nodes, the retransmission of the message is carried out repeatedly.
> > >
> > > Step 3) Then the memory of the master process increases little by little.
> > >
> > >
> > > -------- As a result of the ps command of the master process ----------
> > > * node1
> > > (start)
> > > 32126 ?        SLs    0:00      0   182 53989  7128  0.0 heartbeat: master control process
> > > (One hour later)
> > > 32126 ?        SLs    0:03      0   182 54729  7868  0.0 heartbeat: master control process
> > > (Two hour later)
> > > 32126 ?        SLs    0:08      0   182 55317  8456  0.0 heartbeat: master control process
> > > (Four hours later)
> > > 32126 ?        SLs    0:24      0   182 56673  9812  0.0 heartbeat: master control process
> > >
> > > * node2
> > > (start)
> > > 31928 ?        SLs    0:00      0   182 53989  7128  0.0 heartbeat: master control process
> > > (One hour later)
> > > 31928 ?        SLs    0:02      0   182 54481  7620  0.0 heartbeat: master control process
> > > (Two hour later)
> > > 31928 ?        SLs    0:08      0   182 55353  8492  0.0 heartbeat: master control process
> > > (Four hours later)
> > > 31928 ?        SLs    0:23      0   182 56689  9828  0.0 heartbeat: master control process
> > >
> > >
> > > The state of the memory leak seems to vary according to a node with the quantity of the retransmission.
> > >
> > > The increase of this memory disappears by applying my patch.
> > >
> > > And the similar correspondence seems to be necessary in send_reqnodes_msg(), but this is like little leak.
> > >
> > > Best Regards,
> > > Hideo Yamauchi.
> > >
> > >
> > > --- On Sat, 2012/4/28, renayama19661014@ybb.ne.jp <renayama19661014@ybb.ne.jp> wrote:
> > >
> > > > Hi Lars,
> > > >
> > > > Thank you for comments.
> > > >
> > > > > Have you actually been able to measure that memory leak you observed,
> > > > > and you can confirm this patch will fix it?
> > > > >
> > > > > Because I don't think this patch has any effect.
> > > >
> > > > Yes.
> > > > I really measured leak.
> > > > I can show a result next week.
> > > > #Japan is a holiday until Tuesday.
> > > >
> > > > >
> > > > > send_rexmit_request() is only used as paramter to
> > > > > Gmain_timeout_add_full, and it returns FALSE always,
> > > > > which should cause the respective sourceid to be auto-removed.
> > > >
> > > > It seems to be necessary to release gsource somehow or other.
> > > > The similar liberation seems to be carried out in lrmd.
> > > >
> > > > Best Regards,
> > > > Hideo Yamauchi.
> > > >
> > > >
> > > > --- On Fri, 2012/4/27, Lars Ellenberg <lars.ellenberg@linbit.com> wrote:
> > > >
> > > > > On Thu, Apr 26, 2012 at 10:56:30AM +0900, renayama19661014@ybb.ne.jp wrote:
> > > > > > Hi All,
> > > > > >
> > > > > > We gave test that assumed remote cluster environment.
> > > > > > And we tested packet lost.
> > > > > >
> > > > > > The retransmission timer of Heartbeat causes memory leak.
> > > > > >
> > > > > > I donate a patch.
> > > > > > Please confirm the contents of the patch.
> > > > > > And please reflect a patch in a repository of Heartbeat.
> > > > >
> > > > > Have you actually been able to measure that memory leak you observed,
> > > > > and you can confirm this patch will fix it?
> > > > >
> > > > > Because I don't think this patch has any effect.
> > > > >
> > > > > send_rexmit_request() is only used as paramter to
> > > > > Gmain_timeout_add_full, and it returns FALSE always,
> > > > > which should cause the respective sourceid to be auto-removed.
> > > > >
> > > > >
> > > > > > diff -r 106ca984041b heartbeat/hb_rexmit.c
> > > > > > --- a/heartbeat/hb_rexmit.c    Thu Apr 26 19:28:26 2012 +0900
> > > > > > +++ b/heartbeat/hb_rexmit.c    Thu Apr 26 19:31:44 2012 +0900
> > > > > > @@ -164,6 +164,8 @@
> > > > > >      seqno_t seq = (seqno_t) ri->seq;
> > > > > >      struct node_info* node = ri->node;
> > > > > >      struct ha_msg*    hmsg;
> > > > > > +    unsigned long           sourceid;
> > > > > > +    gpointer value;
> > > > > > 
> > > > > >      if (STRNCMP_CONST(node->status, UPSTATUS) != 0 &&
> > > > > >          STRNCMP_CONST(node->status, ACTIVESTATUS) !=0) {
> > > > > > @@ -196,11 +198,17 @@
> > > > > >     
> > > > > >      node->track.last_rexmit_req = time_longclock();   
> > > > > >     
> > > > > > -    if (!g_hash_table_remove(rexmit_hash_table, ri)){
> > > > > > -        cl_log(LOG_ERR, "%s: entry not found in rexmit_hash_table"
> > > > > > -               "for seq/node(%ld %s)",                
> > > > > > -               __FUNCTION__, ri->seq, ri->node->nodename);
> > > > > > -        return FALSE;
> > > > > > +    value = g_hash_table_lookup(rexmit_hash_table, ri);
> > > > > > +    if ( value != NULL) {
> > > > > > +        sourceid = (unsigned long) value;
> > > > > > +        Gmain_timeout_remove(sourceid);
> > > > > > +
> > > > > > +        if (!g_hash_table_remove(rexmit_hash_table, ri)){
> > > > > > +            cl_log(LOG_ERR, "%s: entry not found in rexmit_hash_table"
> > > > > > +                       "for seq/node(%ld %s)",                
> > > > > > +                       __FUNCTION__, ri->seq, ri->node->nodename);
> > > > > > +            return FALSE;
> > > > > > +        }
> > > > > >      }
> > > > > >     
> > > > > >      schedule_rexmit_request(node, seq, max_rexmit_delay);
>
> --
> : Lars Ellenberg
> : LINBIT | Your Way to High Availability
> : DRBD/HA support and consulting http://www.linbit.com
>
> DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
> _______________________________________________________
> Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
> http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
> Home Page: http://linux-ha.org/
>
_______________________________________________________
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/
Re: [Patch] The patch which revises memory leak. [ In reply to ]
This is very interesting. My apologies for missing this memory leak
:-(. The code logs memory usage periodically exactly to help notice
such a thing.

In my new open source project [http://assimmon.org], I am death on
memory leaks. But I can assure you that back when that code was
written, it was not at all clear who deleted what memory when - when it
came to the glib. I'm not sure if valgrind was out back then, but I
certainly didn't know about it.

I confess that even on this new project I had a heck of a time making
all the glib objects go away. I finally got them cleaned up - but it
took weeks of running under valgrind before I worked out when to do what
to make it throw the objects away - but not crash due to a bad reference.

By the way, I suspect Lars' suggestion would work fine. I would
certainly explain what the "better" patch is in the comments when you
apply this one.


On 05/02/2012 04:57 PM, renayama19661014@ybb.ne.jp wrote:
> Hi Lars,
>
> Thank you for comments.
>
>>> And when it passes more than a full day....
>>>
>>> * node1
>>> 32126 ? SLs 79:52 0 182 71189 24328 0.1 heartbeat: master control process
>>>
>>> * node2
>>> 31928 ? SLs 77:01 0 182 70869 24008 0.1 heartbeat: master control process
>> Oh, I see.
>>
>> This is a "design choice" (maybe not even intentional) of the Gmain_*
>> wrappers used throughout the heartbeat code.
>>
>> The "real" glib g_timeout_add_full(), and most other similar functions,
>> internally do
>> id = g_source_attach(source, ...);
>> g_source_unref(source);
>> return id;
>>
>> Thus in g_main_dispatch, the
>> need_destroy = ! dispatch (...)
>> if (need_destroy)
>> g_source_destroy_internal()
>>
>> in fact ends up destroying it,
>> if dispatch() returns FALSE,
>> as documented:
>> The function is called repeatedly until it returns FALSE, at
>> which point the timeout is automatically destroyed and the
>> function will not be called again.
>>
>> Not so with the heartbeat/glue/libplumbing Gmain_timeout_add_full.
>> It does not g_source_unref(), so we keep the extra reference around
>> until someone explicitly calls Gmain_timeout_remove().
>>
>> Talk about principle of least surprise :(
>>
>> Changing this behaviour to match glib's, i.e. unref'ing after
>> g_source_attach, would seem like the "correct" thing to do,
>> but is likely to break other pieces of code in subtle ways,
>> so it may not be the "right" thing to do at this point.
> Thank you for detailed explanation.
> If you found the method that is appropriate than the correction that I suggested, I approve of it.
>
>> I'm going to take your patch more or less as is.
>> If it does not show up soon, prod me again.
>>
> All right.
>
> Many Thanks!
> Hideo Yamauchi.
>
>
>> Thank you for tracking this down.
>>
>>> Best Regards,
>>> Hideo Yamauchi.
>>>
>>>
>>> --- On Tue, 2012/5/1, renayama19661014@ybb.ne.jp<renayama19661014@ybb.ne.jp> wrote:
>>>
>>>> Hi Lars,
>>>>
>>>> We confirmed that this problem occurred with v1 mode of Heartbeat.
>>>> * The problem happens with the v2 mode in the same way.
>>>>
>>>> We confirmed a problem in the next procedure.
>>>>
>>>> Step 1) Put a special device extinguishing a communication packet of Heartbeat in the network.
>>>>
>>>> Step 2) Between nodes, the retransmission of the message is carried out repeatedly.
>>>>
>>>> Step 3) Then the memory of the master process increases little by little.
>>>>
>>>>
>>>> -------- As a result of the ps command of the master process ----------
>>>> * node1
>>>> (start)
>>>> 32126 ? SLs 0:00 0 182 53989 7128 0.0 heartbeat: master control process
>>>> (One hour later)
>>>> 32126 ? SLs 0:03 0 182 54729 7868 0.0 heartbeat: master control process
>>>> (Two hour later)
>>>> 32126 ? SLs 0:08 0 182 55317 8456 0.0 heartbeat: master control process
>>>> (Four hours later)
>>>> 32126 ? SLs 0:24 0 182 56673 9812 0.0 heartbeat: master control process
>>>>
>>>> * node2
>>>> (start)
>>>> 31928 ? SLs 0:00 0 182 53989 7128 0.0 heartbeat: master control process
>>>> (One hour later)
>>>> 31928 ? SLs 0:02 0 182 54481 7620 0.0 heartbeat: master control process
>>>> (Two hour later)
>>>> 31928 ? SLs 0:08 0 182 55353 8492 0.0 heartbeat: master control process
>>>> (Four hours later)
>>>> 31928 ? SLs 0:23 0 182 56689 9828 0.0 heartbeat: master control process
>>>>
>>>>
>>>> The state of the memory leak seems to vary according to a node with the quantity of the retransmission.
>>>>
>>>> The increase of this memory disappears by applying my patch.
>>>>
>>>> And the similar correspondence seems to be necessary in send_reqnodes_msg(), but this is like little leak.
>>>>
>>>> Best Regards,
>>>> Hideo Yamauchi.
>>>>
>>>>
>>>> --- On Sat, 2012/4/28, renayama19661014@ybb.ne.jp<renayama19661014@ybb.ne.jp> wrote:
>>>>
>>>>> Hi Lars,
>>>>>
>>>>> Thank you for comments.
>>>>>
>>>>>> Have you actually been able to measure that memory leak you observed,
>>>>>> and you can confirm this patch will fix it?
>>>>>>
>>>>>> Because I don't think this patch has any effect.
>>>>> Yes.
>>>>> I really measured leak.
>>>>> I can show a result next week.
>>>>> #Japan is a holiday until Tuesday.
>>>>>
>>>>>> send_rexmit_request() is only used as paramter to
>>>>>> Gmain_timeout_add_full, and it returns FALSE always,
>>>>>> which should cause the respective sourceid to be auto-removed.
>>>>> It seems to be necessary to release gsource somehow or other.
>>>>> The similar liberation seems to be carried out in lrmd.
>>>>>
>>>>> Best Regards,
>>>>> Hideo Yamauchi.
>>>>>
>>>>>
>>>>> --- On Fri, 2012/4/27, Lars Ellenberg<lars.ellenberg@linbit.com> wrote:
>>>>>
>>>>>> On Thu, Apr 26, 2012 at 10:56:30AM +0900, renayama19661014@ybb.ne.jp wrote:
>>>>>>> Hi All,
>>>>>>>
>>>>>>> We gave test that assumed remote cluster environment.
>>>>>>> And we tested packet lost.
>>>>>>>
>>>>>>> The retransmission timer of Heartbeat causes memory leak.
>>>>>>>
>>>>>>> I donate a patch.
>>>>>>> Please confirm the contents of the patch.
>>>>>>> And please reflect a patch in a repository of Heartbeat.
>>>>>> Have you actually been able to measure that memory leak you observed,
>>>>>> and you can confirm this patch will fix it?
>>>>>>
>>>>>> Because I don't think this patch has any effect.
>>>>>>
>>>>>> send_rexmit_request() is only used as paramter to
>>>>>> Gmain_timeout_add_full, and it returns FALSE always,
>>>>>> which should cause the respective sourceid to be auto-removed.
>>>>>>
>>>>>>
>>>>>>> diff -r 106ca984041b heartbeat/hb_rexmit.c
>>>>>>> --- a/heartbeat/hb_rexmit.c Thu Apr 26 19:28:26 2012 +0900
>>>>>>> +++ b/heartbeat/hb_rexmit.c Thu Apr 26 19:31:44 2012 +0900
>>>>>>> @@ -164,6 +164,8 @@
>>>>>>> seqno_t seq = (seqno_t) ri->seq;
>>>>>>> struct node_info* node = ri->node;
>>>>>>> struct ha_msg* hmsg;
>>>>>>> + unsigned long sourceid;
>>>>>>> + gpointer value;
>>>>>>>
>>>>>>> if (STRNCMP_CONST(node->status, UPSTATUS) != 0&&
>>>>>>> STRNCMP_CONST(node->status, ACTIVESTATUS) !=0) {
>>>>>>> @@ -196,11 +198,17 @@
>>>>>>>
>>>>>>> node->track.last_rexmit_req = time_longclock();
>>>>>>>
>>>>>>> - if (!g_hash_table_remove(rexmit_hash_table, ri)){
>>>>>>> - cl_log(LOG_ERR, "%s: entry not found in rexmit_hash_table"
>>>>>>> - "for seq/node(%ld %s)",
>>>>>>> - __FUNCTION__, ri->seq, ri->node->nodename);
>>>>>>> - return FALSE;
>>>>>>> + value = g_hash_table_lookup(rexmit_hash_table, ri);
>>>>>>> + if ( value != NULL) {
>>>>>>> + sourceid = (unsigned long) value;
>>>>>>> + Gmain_timeout_remove(sourceid);
>>>>>>> +
>>>>>>> + if (!g_hash_table_remove(rexmit_hash_table, ri)){
>>>>>>> + cl_log(LOG_ERR, "%s: entry not found in rexmit_hash_table"
>>>>>>> + "for seq/node(%ld %s)",
>>>>>>> + __FUNCTION__, ri->seq, ri->node->nodename);
>>>>>>> + return FALSE;
>>>>>>> + }
>>>>>>> }
>>>>>>>
>>>>>>> schedule_rexmit_request(node, seq, max_rexmit_delay);
>> --
>> : Lars Ellenberg
>> : LINBIT | Your Way to High Availability
>> : DRBD/HA support and consulting http://www.linbit.com
>>
>> DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
>> _______________________________________________________
>> Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
>> http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
>> Home Page: http://linux-ha.org/
>>
> _______________________________________________________
> Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
> http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
> Home Page: http://linux-ha.org/


--
Alan Robertson<alanr@unix.sh> - @OSSAlanR

"Openness is the foundation and preservative of friendship... Let me claim from you at all times your undisguised opinions." - William Wilberforce
_______________________________________________________
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/
Re: [Patch] The patch which revises memory leak. [ In reply to ]
FYI: there is code in the heartbeat communication layer which is quite
happy to simulate lost packets.

I made it difficult to turn on accidentally. Read the code for details
if you're interested.



On 04/30/2012 10:21 PM, renayama19661014@ybb.ne.jp wrote:
> Hi Lars,
>
> We confirmed that this problem occurred with v1 mode of Heartbeat.
> * The problem happens with the v2 mode in the same way.
>
> We confirmed a problem in the next procedure.
>
> Step 1) Put a special device extinguishing a communication packet of Heartbeat in the network.
>
> Step 2) Between nodes, the retransmission of the message is carried out repeatedly.
>
> Step 3) Then the memory of the master process increases little by little.
>
>
> -------- As a result of the ps command of the master process ----------
> * node1
> (start)
> 32126 ? SLs 0:00 0 182 53989 7128 0.0 heartbeat: master control process
> (One hour later)
> 32126 ? SLs 0:03 0 182 54729 7868 0.0 heartbeat: master control process
> (Two hour later)
> 32126 ? SLs 0:08 0 182 55317 8456 0.0 heartbeat: master control process
> (Four hours later)
> 32126 ? SLs 0:24 0 182 56673 9812 0.0 heartbeat: master control process
>
> * node2
> (start)
> 31928 ? SLs 0:00 0 182 53989 7128 0.0 heartbeat: master control process
> (One hour later)
> 31928 ? SLs 0:02 0 182 54481 7620 0.0 heartbeat: master control process
> (Two hour later)
> 31928 ? SLs 0:08 0 182 55353 8492 0.0 heartbeat: master control process
> (Four hours later)
> 31928 ? SLs 0:23 0 182 56689 9828 0.0 heartbeat: master control process
>
>
> The state of the memory leak seems to vary according to a node with the quantity of the retransmission.
>
> The increase of this memory disappears by applying my patch.
>
> And the similar correspondence seems to be necessary in send_reqnodes_msg(), but this is like little leak.
>
> Best Regards,
> Hideo Yamauchi.
>
>
> --- On Sat, 2012/4/28, renayama19661014@ybb.ne.jp<renayama19661014@ybb.ne.jp> wrote:
>
>> Hi Lars,
>>
>> Thank you for comments.
>>
>>> Have you actually been able to measure that memory leak you observed,
>>> and you can confirm this patch will fix it?
>>>
>>> Because I don't think this patch has any effect.
>> Yes.
>> I really measured leak.
>> I can show a result next week.
>> #Japan is a holiday until Tuesday.
>>
>>> send_rexmit_request() is only used as paramter to
>>> Gmain_timeout_add_full, and it returns FALSE always,
>>> which should cause the respective sourceid to be auto-removed.
>> It seems to be necessary to release gsource somehow or other.
>> The similar liberation seems to be carried out in lrmd.
>>
>> Best Regards,
>> Hideo Yamauchi.
>>
>>
>> --- On Fri, 2012/4/27, Lars Ellenberg<lars.ellenberg@linbit.com> wrote:
>>
>>> On Thu, Apr 26, 2012 at 10:56:30AM +0900, renayama19661014@ybb.ne.jp wrote:
>>>> Hi All,
>>>>
>>>> We gave test that assumed remote cluster environment.
>>>> And we tested packet lost.
>>>>
>>>> The retransmission timer of Heartbeat causes memory leak.
>>>>
>>>> I donate a patch.
>>>> Please confirm the contents of the patch.
>>>> And please reflect a patch in a repository of Heartbeat.
>>> Have you actually been able to measure that memory leak you observed,
>>> and you can confirm this patch will fix it?
>>>
>>> Because I don't think this patch has any effect.
>>>
>>> send_rexmit_request() is only used as paramter to
>>> Gmain_timeout_add_full, and it returns FALSE always,
>>> which should cause the respective sourceid to be auto-removed.
>>>
>>>
>>>> diff -r 106ca984041b heartbeat/hb_rexmit.c
>>>> --- a/heartbeat/hb_rexmit.c Thu Apr 26 19:28:26 2012 +0900
>>>> +++ b/heartbeat/hb_rexmit.c Thu Apr 26 19:31:44 2012 +0900
>>>> @@ -164,6 +164,8 @@
>>>> seqno_t seq = (seqno_t) ri->seq;
>>>> struct node_info* node = ri->node;
>>>> struct ha_msg* hmsg;
>>>> + unsigned long sourceid;
>>>> + gpointer value;
>>>>
>>>> if (STRNCMP_CONST(node->status, UPSTATUS) != 0&&
>>>> STRNCMP_CONST(node->status, ACTIVESTATUS) !=0) {
>>>> @@ -196,11 +198,17 @@
>>>>
>>>> node->track.last_rexmit_req = time_longclock();
>>>>
>>>> - if (!g_hash_table_remove(rexmit_hash_table, ri)){
>>>> - cl_log(LOG_ERR, "%s: entry not found in rexmit_hash_table"
>>>> - "for seq/node(%ld %s)",
>>>> - __FUNCTION__, ri->seq, ri->node->nodename);
>>>> - return FALSE;
>>>> + value = g_hash_table_lookup(rexmit_hash_table, ri);
>>>> + if ( value != NULL) {
>>>> + sourceid = (unsigned long) value;
>>>> + Gmain_timeout_remove(sourceid);
>>>> +
>>>> + if (!g_hash_table_remove(rexmit_hash_table, ri)){
>>>> + cl_log(LOG_ERR, "%s: entry not found in rexmit_hash_table"
>>>> + "for seq/node(%ld %s)",
>>>> + __FUNCTION__, ri->seq, ri->node->nodename);
>>>> + return FALSE;
>>>> + }
>>>> }
>>>>
>>>> schedule_rexmit_request(node, seq, max_rexmit_delay);
>>>
>>> --
>>> : Lars Ellenberg
>>> : LINBIT | Your Way to High Availability
>>> : DRBD/HA support and consulting http://www.linbit.com
>>>
>>> DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
>>> _______________________________________________________
>>> Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
>>> http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
>>> Home Page: http://linux-ha.org/
>>>
>> _______________________________________________________
>> Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
>> http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
>> Home Page: http://linux-ha.org/
>>
> _______________________________________________________
> Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
> http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
> Home Page: http://linux-ha.org/


--
Alan Robertson<alanr@unix.sh> - @OSSAlanR

"Openness is the foundation and preservative of friendship... Let me claim from you at all times your undisguised opinions." - William Wilberforce
_______________________________________________________
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/
Re: [Patch] The patch which revises memory leak. [ In reply to ]
Hi Alan,

Thank you for comments.

> FYI: there is code in the heartbeat communication layer which is quite happy to simulate lost packets.
>
> I made it difficult to turn on accidentally. Read the code for details if you're interested.

All right.

Many Thanks,
Hideo Yamauchi.

--- On Tue, 2012/5/8, Alan Robertson <alanr@unix.sh> wrote:

> FYI: there is code in the heartbeat communication layer which is quite happy to simulate lost packets.
>
> I made it difficult to turn on accidentally.  Read the code for details if you're interested.
>
>
>
> On 04/30/2012 10:21 PM, renayama19661014@ybb.ne.jp wrote:
> > Hi Lars,
> >
> > We confirmed that this problem occurred with v1 mode of Heartbeat.
> >   * The problem happens with the v2 mode in the same way.
> >
> > We confirmed a problem in the next procedure.
> >
> > Step 1) Put a special device extinguishing a communication packet of Heartbeat in the network.
> >
> > Step 2) Between nodes, the retransmission of the message is carried out repeatedly.
> >
> > Step 3) Then the memory of the master process increases little by little.
> >
> >
> > -------- As a result of the ps command of the master process ----------
> > * node1
> > (start)
> > 32126 ?        SLs    0:00      0   182 53989  7128  0.0 heartbeat: master control process
> > (One hour later)
> > 32126 ?        SLs    0:03      0   182 54729  7868  0.0 heartbeat: master control process
> > (Two hour later)
> > 32126 ?        SLs    0:08      0   182 55317  8456  0.0 heartbeat: master control process
> > (Four hours later)
> > 32126 ?        SLs    0:24      0   182 56673  9812  0.0 heartbeat: master control process
> >
> > * node2
> > (start)
> > 31928 ?        SLs    0:00      0   182 53989  7128  0.0 heartbeat: master control process
> > (One hour later)
> > 31928 ?        SLs    0:02      0   182 54481  7620  0.0 heartbeat: master control process
> > (Two hour later)
> > 31928 ?        SLs    0:08      0   182 55353  8492  0.0 heartbeat: master control process
> > (Four hours later)
> > 31928 ?        SLs    0:23      0   182 56689  9828  0.0 heartbeat: master control process
> >
> >
> > The state of the memory leak seems to vary according to a node with the quantity of the retransmission.
> >
> > The increase of this memory disappears by applying my patch.
> >
> > And the similar correspondence seems to be necessary in send_reqnodes_msg(), but this is like little leak.
> >
> > Best Regards,
> > Hideo Yamauchi.
> >
> >
> > --- On Sat, 2012/4/28, renayama19661014@ybb.ne.jp<renayama19661014@ybb.ne.jp>  wrote:
> >
> >> Hi Lars,
> >>
> >> Thank you for comments.
> >>
> >>> Have you actually been able to measure that memory leak you observed,
> >>> and you can confirm this patch will fix it?
> >>>
> >>> Because I don't think this patch has any effect.
> >> Yes.
> >> I really measured leak.
> >> I can show a result next week.
> >> #Japan is a holiday until Tuesday.
> >>
> >>> send_rexmit_request() is only used as paramter to
> >>> Gmain_timeout_add_full, and it returns FALSE always,
> >>> which should cause the respective sourceid to be auto-removed.
> >> It seems to be necessary to release gsource somehow or other.
> >> The similar liberation seems to be carried out in lrmd.
> >>
> >> Best Regards,
> >> Hideo Yamauchi.
> >>
> >>
> >> --- On Fri, 2012/4/27, Lars Ellenberg<lars.ellenberg@linbit.com>  wrote:
> >>
> >>> On Thu, Apr 26, 2012 at 10:56:30AM +0900, renayama19661014@ybb.ne.jp wrote:
> >>>> Hi All,
> >>>>
> >>>> We gave test that assumed remote cluster environment.
> >>>> And we tested packet lost.
> >>>>
> >>>> The retransmission timer of Heartbeat causes memory leak.
> >>>>
> >>>> I donate a patch.
> >>>> Please confirm the contents of the patch.
> >>>> And please reflect a patch in a repository of Heartbeat.
> >>> Have you actually been able to measure that memory leak you observed,
> >>> and you can confirm this patch will fix it?
> >>>
> >>> Because I don't think this patch has any effect.
> >>>
> >>> send_rexmit_request() is only used as paramter to
> >>> Gmain_timeout_add_full, and it returns FALSE always,
> >>> which should cause the respective sourceid to be auto-removed.
> >>>
> >>>
> >>>> diff -r 106ca984041b heartbeat/hb_rexmit.c
> >>>> --- a/heartbeat/hb_rexmit.c    Thu Apr 26 19:28:26 2012 +0900
> >>>> +++ b/heartbeat/hb_rexmit.c    Thu Apr 26 19:31:44 2012 +0900
> >>>> @@ -164,6 +164,8 @@
> >>>>        seqno_t seq = (seqno_t) ri->seq;
> >>>>        struct node_info* node = ri->node;
> >>>>        struct ha_msg*    hmsg;
> >>>> +    unsigned long           sourceid;
> >>>> +    gpointer value;
> >>>>          if (STRNCMP_CONST(node->status, UPSTATUS) != 0&&
> >>>>            STRNCMP_CONST(node->status, ACTIVESTATUS) !=0) {
> >>>> @@ -196,11 +198,17 @@
> >>>>              node->track.last_rexmit_req = time_longclock();         -    if (!g_hash_table_remove(rexmit_hash_table, ri)){
> >>>> -        cl_log(LOG_ERR, "%s: entry not found in rexmit_hash_table"
> >>>> -               "for seq/node(%ld %s)",                -               __FUNCTION__, ri->seq, ri->node->nodename);
> >>>> -        return FALSE;
> >>>> +    value = g_hash_table_lookup(rexmit_hash_table, ri);
> >>>> +    if ( value != NULL) {
> >>>> +        sourceid = (unsigned long) value;
> >>>> +        Gmain_timeout_remove(sourceid);
> >>>> +
> >>>> +        if (!g_hash_table_remove(rexmit_hash_table, ri)){
> >>>> +            cl_log(LOG_ERR, "%s: entry not found in rexmit_hash_table"
> >>>> +                       "for seq/node(%ld %s)",                +                       __FUNCTION__, ri->seq, ri->node->nodename);
> >>>> +            return FALSE;
> >>>> +        }
> >>>>        }
> >>>>              schedule_rexmit_request(node, seq, max_rexmit_delay);
> >>>
> >>> -- : Lars Ellenberg
> >>> : LINBIT | Your Way to High Availability
> >>> : DRBD/HA support and consulting http://www.linbit.com
> >>>
> >>> DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
> >>> _______________________________________________________
> >>> Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
> >>> http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
> >>> Home Page: http://linux-ha.org/
> >>>
> >> _______________________________________________________
> >> Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
> >> http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
> >> Home Page: http://linux-ha.org/
> >>
> > _______________________________________________________
> > Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
> > http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
> > Home Page: http://linux-ha.org/
>
>
> --     Alan Robertson<alanr@unix.sh>  - @OSSAlanR
>
> "Openness is the foundation and preservative of friendship...  Let me claim from you at all times your undisguised opinions." - William Wilberforce
>
_______________________________________________________
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/
Re: [Patch] The patch which revises memory leak. [ In reply to ]
On Wed, May 02, 2012 at 09:00:35PM -0600, Alan Robertson wrote:
> This is very interesting. My apologies for missing this memory leak
> :-(. The code logs memory usage periodically exactly to help notice
> such a thing.
>
> In my new open source project [http://assimmon.org], I am death on
> memory leaks. But I can assure you that back when that code was
> written, it was not at all clear who deleted what memory when - when it
> came to the glib. I'm not sure if valgrind was out back then, but I
> certainly didn't know about it.
>
> I confess that even on this new project I had a heck of a time making
> all the glib objects go away. I finally got them cleaned up - but it
> took weeks of running under valgrind before I worked out when to do what
> to make it throw the objects away - but not crash due to a bad reference.
>
> By the way, I suspect Lars' suggestion would work fine. I would
> certainly explain what the "better" patch is in the comments when you
> apply this one.

My suggestion would be:


diff --git a/lib/clplumbing/GSource.c b/lib/clplumbing/GSource.c
--- a/lib/clplumbing/GSource.c
+++ b/lib/clplumbing/GSource.c
@@ -1506,6 +1506,7 @@ Gmain_timeout_add_full(gint priority
g_source_set_callback(source, function, data, notify);

append->gsourceid = g_source_attach(source, NULL);
+ g_source_unref(source);
return append->gsourceid;

}


Then recompile and reinstall libglue.
Please see if that also fixes the leak for you
(and does not explode).

Note: that is *instead* of your patch,
not in addition to.

Thanks,
Lars

--
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
_______________________________________________________
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/
Re: [Patch] The patch which revises memory leak. [ In reply to ]
On Mon, May 14, 2012 at 05:44:55PM +0200, Lars Ellenberg wrote:
> > By the way, I suspect Lars' suggestion would work fine. I would
> > certainly explain what the "better" patch is in the comments when you
> > apply this one.
>
> My suggestion would be:
>
>
> diff --git a/lib/clplumbing/GSource.c b/lib/clplumbing/GSource.c
> --- a/lib/clplumbing/GSource.c
> +++ b/lib/clplumbing/GSource.c
> @@ -1506,6 +1506,7 @@ Gmain_timeout_add_full(gint priority
> g_source_set_callback(source, function, data, notify);
>
> append->gsourceid = g_source_attach(source, NULL);
> + g_source_unref(source);
> return append->gsourceid;
>
> }
>
>
> Then recompile and reinstall libglue.
> Please see if that also fixes the leak for you
> (and does not explode).

Hm. Looks like it *does* explode (aka segfault)
on various occasions throughout at least heartbeat core and lrmd,
probably when a timeout callback tries to remove itself
explicitly before returning FALSE.

Guess we need to stay bug-to-bug compatible here,
unless the users of Gmain_timeout_* from clplumbing
turn out to be less than a handful and can be audited.
Which may well be the case.

Though it would break the ABI, so it would be a lot of effort,
and I frankly doubt this is worth it for clplumbing
at this point in its lifecycle.

:-(

--
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com
_______________________________________________________
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/
Re: [Patch] The patch which revises memory leak. [ In reply to ]
On Tue, May 15, 2012 at 11:14:53AM +0200, Lars Ellenberg wrote:
> On Mon, May 14, 2012 at 05:44:55PM +0200, Lars Ellenberg wrote:
> > > By the way, I suspect Lars' suggestion would work fine. I would
> > > certainly explain what the "better" patch is in the comments when you
> > > apply this one.

> Hm. Looks like it *does* explode (aka segfault)

Continuing my monologue ...
it may just have been incomplete.

The patch below seems to work just fine.

I managed to occasionally trigger the
"Attempt to remove timeout (%u) with NULL source"
message, but I have seen that one without the patch as well,
so that may just be some other oddity somewhere:
double removal of timeout resources ;-)

We can find and drop those later,
they look harmless enough.

I do not see any memleak anywhere anymore with this patch applied.

Comments/review/testing welcome.

# HG changeset patch
# User Lars Ellenberg <lars@linbit.com>
# Date 1337066453 -7200
# Node ID e63dd41f46b7bd150a23a62303bde6be78305c9c
# Parent 63d968249025b245e38b1da6d0202438ec45ebf3
[mq]: potential-fix-for-timer-leak

diff --git a/lib/clplumbing/GSource.c b/lib/clplumbing/GSource.c
--- a/lib/clplumbing/GSource.c
+++ b/lib/clplumbing/GSource.c
@@ -1507,6 +1507,7 @@
g_source_set_callback(source, function, data, notify);

append->gsourceid = g_source_attach(source, NULL);
+ g_source_unref(source);
return append->gsourceid;

}
@@ -1517,14 +1518,12 @@
GSource* source = g_main_context_find_source_by_id(NULL,tag);
struct GTimeoutAppend* append = GTIMEOUT(source);

- g_source_remove(tag);
-
if (source == NULL){
cl_log(LOG_ERR, "Attempt to remove timeout (%u)"
" with NULL source", tag);
}else{
g_assert(IS_TIMEOUTSRC(append));
- g_source_unref(source);
+ g_source_remove(tag);
}

return;

--
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com
_______________________________________________________
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/
Re: [Patch] The patch which revises memory leak. [ In reply to ]
Hi Lars,

Sorry...An answer was late.

In the environment where we confirmed leak, I confirm your patch.

Many Thanks,
Hideo Yamauchi.

--- On Wed, 2012/5/16, Lars Ellenberg <lars.ellenberg@linbit.com> wrote:

> On Tue, May 15, 2012 at 11:14:53AM +0200, Lars Ellenberg wrote:
> > On Mon, May 14, 2012 at 05:44:55PM +0200, Lars Ellenberg wrote:
> > > > By the way, I suspect Lars' suggestion would work fine.  I would
> > > > certainly explain what the "better" patch is in the comments when you
> > > > apply this one.
>
> > Hm. Looks like it *does* explode (aka segfault)
>
> Continuing my monologue ...
> it may just have been incomplete.
>
> The patch below seems to work just fine.
>
> I managed to occasionally trigger the
>     "Attempt to remove timeout (%u) with NULL source"
> message, but I have seen that one without the patch as well,
> so that may just be some other oddity somewhere:
> double removal of timeout resources ;-)
>
> We can find and drop those later,
> they look harmless enough.
>
> I do not see any memleak anywhere anymore with this patch applied.
>
> Comments/review/testing welcome.
>
> # HG changeset patch
> # User Lars Ellenberg <lars@linbit.com>
> # Date 1337066453 -7200
> # Node ID e63dd41f46b7bd150a23a62303bde6be78305c9c
> # Parent  63d968249025b245e38b1da6d0202438ec45ebf3
> [mq]: potential-fix-for-timer-leak
>
> diff --git a/lib/clplumbing/GSource.c b/lib/clplumbing/GSource.c
> --- a/lib/clplumbing/GSource.c
> +++ b/lib/clplumbing/GSource.c
> @@ -1507,6 +1507,7 @@
>     g_source_set_callback(source, function, data, notify);
>
>     append->gsourceid = g_source_attach(source, NULL);
> +    g_source_unref(source);
>     return append->gsourceid;
>
> }
> @@ -1517,14 +1518,12 @@
>     GSource* source = g_main_context_find_source_by_id(NULL,tag);
>     struct GTimeoutAppend* append = GTIMEOUT(source);
>    
> -    g_source_remove(tag);
> -   
>     if (source == NULL){
>         cl_log(LOG_ERR, "Attempt to remove timeout (%u)"
>         " with NULL source",    tag);
>     }else{
>         g_assert(IS_TIMEOUTSRC(append));
> -        g_source_unref(source);
> +        g_source_remove(tag);
>     }
>    
>     return;
>
> --
> : Lars Ellenberg
> : LINBIT | Your Way to High Availability
> : DRBD/HA support and consulting http://www.linbit.com
> _______________________________________________________
> Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
> http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
> Home Page: http://linux-ha.org/
>
_______________________________________________________
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/
Re: [Patch] The patch which revises memory leak. [ In reply to ]
On Wed, May 16, 2012 at 09:33:48AM +0900, renayama19661014@ybb.ne.jp wrote:
> Hi Lars,
>
> In the environment where we confirmed leak, I confirm your patch.

Pushed to http://hg.linux-ha.org/glue

Thanks,

--
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
_______________________________________________________
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/
Re: [Patch] The patch which revises memory leak. [ In reply to ]
Hi Lars,

> Pushed to http://hg.linux-ha.org/glue

We confirmed that a problem was settled with your patch.

Many Thanks!
Hideo Yamauchi.

--- On Thu, 2012/5/17, Lars Ellenberg <lars.ellenberg@linbit.com> wrote:

> On Wed, May 16, 2012 at 09:33:48AM +0900, renayama19661014@ybb.ne.jp wrote:
> > Hi Lars,
> >
> > In the environment where we confirmed leak, I confirm your patch.
>
> Pushed to http://hg.linux-ha.org/glue
>
> Thanks,
>
> --
> : Lars Ellenberg
> : LINBIT | Your Way to High Availability
> : DRBD/HA support and consulting http://www.linbit.com
>
> DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
> _______________________________________________________
> Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
> http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
> Home Page: http://linux-ha.org/
>
_______________________________________________________
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/