Mailing List Archive: Patches: RFC before pull request

Andrew,
All,

Please have a look at the patches I queued up here:
https://github.com/lge/pacemaker/commits/for-beekhof

Most (not all) are specific for the heartbeat cluster stack.

Thanks,
Lars

A few comments here:

-----

This effectively changes crm_mon output,
but also changes logging where this method is invoked:

Low: native_print: report target-role as well

This is for the "Why does my resource not start?" guys who
forgot to remove the limiting target-role setting.

Report target role (unless "Started", which is the default anyways),
if it limits our abilities (Slave, Stopped),
or if it differs from the current status.

-----

Heartbeat specific:

Low: allow heartbeat to spawn the pengine itself, and tell crmd about it

Heartbeat 3.0.6 now may spawn the pengine directly, and will announce
this in the environment -- I introduced the setting "crmd_spawns_pengine".

This improves shutdown behavior. Otherwise I regularly find an orphaned
pengine process after pacemaker shutdown.

-----

Heartbeat specific, as consequence of the fix blow:

Low: add debugging aid to help spot missing set_msg_callback()s on heartbeat

In ha_msg_dispatch(), change from rcvmsg() to readmsg().
rcvmsg() is internally simply a wrapper around readmsg(),
which silently deletes messages without matching callback.

Use readmsg() directly here. It will only return unprocessed (by
callbacks) messages, so log a warning, notice or debug message
depending on message header information, and ha_msg_del() it ourselves.

-----

Heartbeat specific bug fix:

High: fix stonith ignoring its own messages on heartbeat

Since the introduction of the additional F_TYPE messages
T_STONITH_NOTIFY and T_STONITH_TIMEOUT_VALUE, and their use as message
types in global heartbeat cluster messages, stonith-ng was broken on the
heartbeat cluster stack.

When delegation was made the default, and the result could only be
reaped by listening for the T_STONITH_NOTIFY message, no-one (but
stonithd itself) would ever notice successful completion,
and stonith would be re-issued forever.

Registering callbacks for these F_TYPE fixes these hung stonith and
stonith_admin operations on the heartbeat cluster stack.

-----

Heartbeat specific:

Medium: fix tracking of peer client process status on heartbeat

Don't optimistically assume that peer client processes are alive,
or that a node that can talk to us is in fact member of the same
ccm partition.

Whenever ccm tells us about a new membership, *ask* for peer client
process status.

-----

This oneliner may well be relevant for corosync CPG as well,
possibly one of the reasons the pcmk_cpg_membership() has this funny
"appears to be online even though we think it is dead" block?

fix crm_update_peer_proc to NOT ignore flags if partially set

The "set_bit()" function used here actually deals with masks, not bit numbers.
The "flag" argument should in fact be plural: flags.

These proc flag bits are not always set one at a time,
but for example as "crm_proc_crmd | crm_proc_cpg",
and not necessarily cleared with the same combination.

Ignoring to-be-set flags just because *some* of the flag bits are
already set is clearly a bug, and may be the reason for stale process
cache information.

-----

Heartbeat specific:

Medium: map heartbeat JOIN/LEAVE status to ONLINE/OFFLINE

The rest of the code deals in "online" and "offline",
not "join" and "leave". Need to map these states,
or the rest of the code won't work properly.

-----

Generic, if shutdown is requested before stonith connection was ever established
(due to other problems), inisting to re-try the stonith connection confused the shutdown.

Medium: don't trigger a stonith_reconnect if no longer required

Get rid of some spurious error messages, and speed up shutdown,
even if the connection to the stonith daemon failed.

-----

Non-functional change, just for readability:

Low: use CRM_NODE_MEMBER, not CRM_NODE_ACTIVE

ACTIVE is defined to be MEMBER anyways:
include/crm/cluster.h:#define CRM_NODE_ACTIVE CRM_NODE_MEMBER

Don't confuse the reader of the code
by implying it was something different.

-----

Heartbeat specific, packaging only:

Low: heartbeat 3.0.6 knows to finds the daemons; drop compat symlinks

_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

They all look sane to me. Please proceed with a pull request :-)

We should probably start thinking about .13 (or .14 for the superstitious), there have been quite a few important patches arrive since .12 was released.

> On 10 Dec 2014, at 1:33 am, Lars Ellenberg <Lars.Ellenberg@linbit.com> wrote:
>
>
> Andrew,
> All,
>
> Please have a look at the patches I queued up here:
> https://github.com/lge/pacemaker/commits/for-beekhof
>
> Most (not all) are specific for the heartbeat cluster stack.
>
> Thanks,
> Lars
>
> A few comments here:
>
> -----
>
> This effectively changes crm_mon output,
> but also changes logging where this method is invoked:
>
> Low: native_print: report target-role as well
>
> This is for the "Why does my resource not start?" guys who
> forgot to remove the limiting target-role setting.
>
> Report target role (unless "Started", which is the default anyways),
> if it limits our abilities (Slave, Stopped),
> or if it differs from the current status.
>
> -----
>
> Heartbeat specific:
>
> Low: allow heartbeat to spawn the pengine itself, and tell crmd about it
>
> Heartbeat 3.0.6 now may spawn the pengine directly, and will announce
> this in the environment -- I introduced the setting "crmd_spawns_pengine".
>
> This improves shutdown behavior. Otherwise I regularly find an orphaned
> pengine process after pacemaker shutdown.
>
> -----
>
> Heartbeat specific, as consequence of the fix blow:
>
> Low: add debugging aid to help spot missing set_msg_callback()s on heartbeat
>
> In ha_msg_dispatch(), change from rcvmsg() to readmsg().
> rcvmsg() is internally simply a wrapper around readmsg(),
> which silently deletes messages without matching callback.
>
> Use readmsg() directly here. It will only return unprocessed (by
> callbacks) messages, so log a warning, notice or debug message
> depending on message header information, and ha_msg_del() it ourselves.
>
> -----
>
> Heartbeat specific bug fix:
>
> High: fix stonith ignoring its own messages on heartbeat
>
> Since the introduction of the additional F_TYPE messages
> T_STONITH_NOTIFY and T_STONITH_TIMEOUT_VALUE, and their use as message
> types in global heartbeat cluster messages, stonith-ng was broken on the
> heartbeat cluster stack.
>
> When delegation was made the default, and the result could only be
> reaped by listening for the T_STONITH_NOTIFY message, no-one (but
> stonithd itself) would ever notice successful completion,
> and stonith would be re-issued forever.
>
> Registering callbacks for these F_TYPE fixes these hung stonith and
> stonith_admin operations on the heartbeat cluster stack.
>
> -----
>
> Heartbeat specific:
>
> Medium: fix tracking of peer client process status on heartbeat
>
> Don't optimistically assume that peer client processes are alive,
> or that a node that can talk to us is in fact member of the same
> ccm partition.
>
> Whenever ccm tells us about a new membership, *ask* for peer client
> process status.
>
> -----
>
> This oneliner may well be relevant for corosync CPG as well,
> possibly one of the reasons the pcmk_cpg_membership() has this funny
> "appears to be online even though we think it is dead" block?
>
> fix crm_update_peer_proc to NOT ignore flags if partially set
>
> The "set_bit()" function used here actually deals with masks, not bit numbers.
> The "flag" argument should in fact be plural: flags.
>
> These proc flag bits are not always set one at a time,
> but for example as "crm_proc_crmd | crm_proc_cpg",
> and not necessarily cleared with the same combination.
>
> Ignoring to-be-set flags just because *some* of the flag bits are
> already set is clearly a bug, and may be the reason for stale process
> cache information.
>
> -----
>
> Heartbeat specific:
>
> Medium: map heartbeat JOIN/LEAVE status to ONLINE/OFFLINE
>
> The rest of the code deals in "online" and "offline",
> not "join" and "leave". Need to map these states,
> or the rest of the code won't work properly.
>
> -----
>
> Generic, if shutdown is requested before stonith connection was ever established
> (due to other problems), inisting to re-try the stonith connection confused the shutdown.
>
> Medium: don't trigger a stonith_reconnect if no longer required
>
> Get rid of some spurious error messages, and speed up shutdown,
> even if the connection to the stonith daemon failed.
>
> -----
>
> Non-functional change, just for readability:
>
> Low: use CRM_NODE_MEMBER, not CRM_NODE_ACTIVE
>
> ACTIVE is defined to be MEMBER anyways:
> include/crm/cluster.h:#define CRM_NODE_ACTIVE CRM_NODE_MEMBER
>
> Don't confuse the reader of the code
> by implying it was something different.
>
> -----
>
> Heartbeat specific, packaging only:
>
> Low: heartbeat 3.0.6 knows to finds the daemons; drop compat symlinks
>
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org