Mailing List Archive

why i don't like heartbeat
On this mailing list, i critized several time heartbeat without much
result, so i send this text. It is a summary for archive purpose.
It has already been discussed in the past, so i dont see the need
to do it again.

In this text, i explain why i don't like heartbeat. Nevertheless i
appears to be the only one to claim this opinion, so a consensus
has been reached around heartbeat to think it is a 'good thing'.
I respect this consensus, even if i disagree.

Further, even if i dont like his program, alan roberston still
calmly explains its points of view and i appreciate that.

In short, what i don't like in heartbeat:

1. Heartbeat try to design/rewrite a network stack. i claim that IP already
does the job. IP has been designed/implemented by experienced people,
i think it would be a mistake not to use it.
Moreover to rewrite it will be a HUGE job, and very long to debug.

2. In rfc1925 we can find
" (5) It is always possible to aglutenate multiple separate problems
into a single complex interdependent solution. In most cases
this is a bad idea."
I strongly believe in that, the occam razor concept is close too.
i see heartbeat as the exact opposite. heartbeat tries to include
everything in it without reason.
I remember a sentence whom i unfortunatly forgot the author 'there 2
ways to design a protocol: to make it so simple that there obviously
no error, to make it so complex that there is no obvious error' i
think heartbeat fall in the second category and i would like it
to fall in the first.

In my opinion, a program called heartbeat should monitor the neighbors
life and report the result, no more. Others will take decisions based
on the data reported.
IF rewriting a network stack is needed(and not use IP), IF rewriting
a serial protocol (and not use slip/ppp) is needed, IF a reliable
multicast protocol is needed (what i question especially for a life
monitor), IF rewriting one is needed (and not used one of the ten
existing ones http://www.tascnets.com/mist/doc/mcpCompare.html),
heartbeat isnt the good place to implement it.

3. It isnt documented, so except the people who has the time to study
the code (i tried), nobody will review the custom security (btw it
is commonly admited that a unreviewed security should be assumed
untrustable), the custom network protocols etc... If there is
a mistake in it, the more we wait the harder it will be to fix.
This is undesirable in general but really harmfull in a HA environment.

What i like in heartbeat:
1. it exists and so complies to the linus's rule 'show me the sources'

I hope i made my opinions clear. I will stop to repeat my critics
about heartbeat until i have a source to replace it. I am have
no time to write one now, so enjoy heartbeat :)
why i don't like heartbeat [ In reply to ]
I may regret stirring the controversy here, but I think Mr Etienne
raises some very valid points that deserve serious consideration.

On Thu, May 11, 2000 at 02:49:24PM -0400, Jerome Etienne wrote:
> On this mailing list, i critized several time heartbeat without much
> result, so i send this text. It is a summary for archive purpose.

Thanks for the clear exposition of your views.

> In short, what i don't like in heartbeat:
>
> 1. Heartbeat try to design/rewrite a network stack. i claim that IP already
> does the job. IP has been designed/implemented by experienced people,

I somewhat agree with this, except that the heartbeat protocols (or some
variation) over serial loops or other alternate connections provide a
valuable addition and complement to IP only.

Of course a node that cannot speak IP is not very useful, but it is probably
helpful to be able to distinguish connectivity failures from node failures.


> 2. ...
> " (5) It is always possible to aglutenate multiple separate problems
> into a single complex interdependent solution. In most cases
> this is a bad idea."

Nice quote. ;-)

> In my opinion, a program called heartbeat should monitor the neighbors
> life and report the result, no more. Others will take decisions based
> on the data reported.

This seems desirable from a modularity point of view. See below.

> What i like in heartbeat:
> 1. it exists and so complies to the linus's rule 'show me the sources'

Indeed.

I know that heartbeat is intended to be modular and is a work in
progress, but many other packages are starting to depend on specific details
of heartbeat and its config/action scripts. This not only makes it hard for
alternatives to heartbeat to arise, it also makes it hard to evolve
heartbeat itself.

Also, perhaps heartbeat is too successful. Sometimes I wonder if it is
providing just good enough cluster manager functionality to actually inhibit
the development of the cluster manager itself. If so, it might be worth
considering ruthlessly refactoring it into a separate newborn cluster
manager component and a well developed heartbeat-only component. This
may also be a useful exercise to prepare for the eventual arrival of
Failsafe.

These are some of my thoughts after reading Mr. Etiennes comments and are not
meant to restart any argument in particular, just to promote quiet
contemplation of the alternatives.

Thanks for your patience.

-dg

--
David Gould dgould@suse.com
SuSE, Inc., 580 2cd St. #210, Oakland, CA 94607 510.628.3380
As long as each individual is facing the TV tube alone, formal
freedom poses no threat to privilege. --Noam Chomsky
why i don't like heartbeat [ In reply to ]
On 2000-05-11T14:17:22,
dgould@suse.com said:

> I somewhat agree with this, except that the heartbeat protocols (or some
> variation) over serial loops or other alternate connections provide a
> valuable addition and complement to IP only.
>
> Of course a node that cannot speak IP is not very useful, but it is probably
> helpful to be able to distinguish connectivity failures from node failures.

Not to forget that a STONITH implementation may have a deep, overhelming
desire to try to reset the node with the failed network card. (I never had a
network card go bad under Linux, but I have seen about 5-10 cards go "dead"
because of driver bugs and come alive after a reboot)

If the STONITH relies on IP connectivity via Ethernet, that would not be
useful.

Running PPP/SLIP over serial rings and running an IP routing protocol to
ensure the connectivity is way more complex and errorprone than the serial
ring code in heartbeat.

> > In my opinion, a program called heartbeat should monitor the neighbors
> > life and report the result, no more. Others will take decisions based
> > on the data reported.
> This seems desirable from a modularity point of view. See below.

I would agree. But I do think that both the cluster messaging services and the
heartbeating share a very common functionality: Namely the message passing
(whether it is heartbeat or a status exchange).

So it may make sense to separate them into different modules, but those are
going to be closely interconnected.

> I know that heartbeat is intended to be modular and is a work in
> progress, but many other packages are starting to depend on specific details
> of heartbeat and its config/action scripts. This not only makes it hard for
> alternatives to heartbeat to arise, it also makes it hard to evolve
> heartbeat itself.

And if you look at the shell scripts and try to improve them, you go blind ;-)

I know that heartbeat was mainly a testbed for exactly that - heartbeating -
and that the resource stuff was added just to have something to make use of
the functionality.

This is why we are talking about separating that functionality quickly (the
API to heartbeat is a very important step and Conectiva is working with Alan
on that) so the resource handling can be "exchanged".

> the development of the cluster manager itself. If so, it might be worth
> considering ruthlessly refactoring it into a separate newborn cluster
> manager component and a well developed heartbeat-only component. This
> may also be a useful exercise to prepare for the eventual arrival of
> Failsafe.

Exactly this has been discussed on this list. You may want to check the
archives, or I can bounce you the mail.

Sincerely,
Lars Marowsky-Brée <lmb@suse.de>
Development HA

--
Perfection is our goal, excellence will be tolerated. -- J. Yahl
why i don't like heartbeat [ In reply to ]
On Thu, May 11, 2000 at 02:29:30PM -0700, Lars Marowsky-Bree wrote:
> On 2000-05-11T14:17:22,
> dgould@suse.com said:

> > valuable addition and complement to IP only.
...
> Running PPP/SLIP over serial rings and running an IP routing protocol to
> ensure the connectivity is way more complex and errorprone than the serial
> ring code in heartbeat.

Agreed. My point was to use IP over ethernet in addition to heartbeating over
serial loops. I suppose one could also use ppp, but once one is using IP
it is not a large distinction.

> I would agree. But I do think that both the cluster messaging services and the
> heartbeating share a very common functionality: Namely the message passing
> (whether it is heartbeat or a status exchange).

Depends on what one thinks about the purpose of cluster messaging and
the size of the cluster. Serial links are easy to overload especially as
the message service finds new uses or the cluster grows larger.

> And if you look at the shell scripts and try to improve them, you go blind ;-)

Well, I didn't want to say that...

> Exactly this has been discussed on this list. You may want to check the
> archives, or I can bounce you the mail.

I may have seen it if it was after the 1st of April, but if you would
like, please bounce me the mail. Also, a lot of things are discussed, but
it is not always clear what the current concensus is.

-dg

--
David Gould dgould@suse.com
SuSE, Inc., 580 2cd St. #210, Oakland, CA 94607 510.628.3380
Much of the excitement we get out of our work is that we don't really
know what we are doing. -- E. Dijkstra
why i don't like heartbeat [ In reply to ]
dgould@suse.com wrote:
>
> I may regret stirring the controversy here, but I think Mr Etienne
> raises some very valid points that deserve serious consideration.
>
> On Thu, May 11, 2000 at 02:49:24PM -0400, Jerome Etienne wrote:
> > On this mailing list, i critized several time heartbeat without much
> > result, so i send this text. It is a summary for archive purpose.
>
> Thanks for the clear exposition of your views.
>
> > In short, what i don't like in heartbeat:
> >
> > 1. Heartbeat try to design/rewrite a network stack. i claim that IP already
> > does the job. IP has been designed/implemented by experienced people,
>
> I somewhat agree with this, except that the heartbeat protocols (or some
> variation) over serial loops or other alternate connections provide a
> valuable addition and complement to IP only.

This is Best Current Practice according to the Linux-HA HOWTO. Many
commercial cluster solutions use it in addition to ethernet links for
these reasons.

> Of course a node that cannot speak IP is not very useful, but it is probably
> helpful to be able to distinguish connectivity failures from node failures.

And to give cluster reconfiguration directives to failed nodes. You
don't need to shoot a node in the head which hasn't failed, and you can
still rationally communicate with.

> > In my opinion, a program called heartbeat should monitor the neighbors
> > life and report the result, no more. Others will take decisions based
> > on the data reported.
>
> This seems desirable from a modularity point of view. See below.

This is exactly what it does. It makes no decisions on it's own. I do
not consider the shell scripts as part of heartbeat, but an extremely
rudimentary and separate cluster manager. The current code uses
fork/exec as it's API. Of course, this is a bit lacking ;-) [But see
below]

Jerome doesn't think the name and the code match. "heartbeat" may be
misnamed. But I would NEVER name it "cluster manger". I might name it
ClusterComm, or something. There are really good rational design
reasons for that that I'm not sure if I ever explained, but until Jerome
came around, no one asked ;-)

> > What i like in heartbeat:
> > 1. it exists and so complies to the linus's rule 'show me the sources'
>
> Indeed.
>
> I know that heartbeat is intended to be modular and is a work in
> progress, but many other packages are starting to depend on specific details
> of heartbeat and its config/action scripts. This not only makes it hard for
> alternatives to heartbeat to arise, it also makes it hard to evolve
> heartbeat itself.

I share this concern about the dependencies on details. That's why
we've split out an abstract API [which is currently being implemented].
I did a commit on the first changes for this API yesterday. You can
program using the API without depending on heartbeat's architecture or
structure, or underlying communication mechanisms.

> Also, perhaps heartbeat is too successful. Sometimes I wonder if it is
> providing just good enough cluster manager functionality to actually inhibit
> the development of the cluster manager itself.

I designed it to be capable of providing support for a cluster manager.
I tried hard to do the most you could easily do, but no more. The
simple scripts it uses are the only cluster manager written to go with
it. [See API comments below]

> If so, it might be worth
> considering ruthlessly refactoring it into a separate newborn cluster
> manager component and a well developed heartbeat-only component. This
> may also be a useful exercise to prepare for the eventual arrival of
> Failsafe.

This is exactly what is going on. However, it continues to retain the
basic messaging capability, since it needs them for it's own purposes
(that's how it heartbeats the nodes). It provides a higher-level
abstraction for communication with cluster members than sockets does.
This abstraction does not rely on serial ports, nor on ethernet, nor
does it reveal any details of any of these mechanisms to the user. It
just delivers messages - reliably and quickly, in the presence of
transport failures, using available redundant transport mechanisms.
Moreover, it continually monitors these transport mechanisms and reports
on the failure of individual transport mechanisms. This is part of the
reason why I might have better named it "ClusterComm".

For example, if you only have one ethernet, and configure it and serial
ports, and unplug the ethernet, no packets are lost, it knows that the
node is still up, yet knows that the ethernet has failed within
seconds. When you plug it back in, it knows that it is working again.
This is a very powerful and interesting property for a highly-available
system.

I generally agree with David's comments about it being too successful,
but have a very slightly different view on it. It tried to do
everything which could be done easily, with an eye to being capable of
doing more - as a component of a larger system.

Success causes problems. Generally, I prefer the problems associated
with success to those associated with failure ;-)

It was successful, and generated lots of interest. The result was that
various people have felt the need to extend it. Since I have never
given a clear exposition of the architecture and design philosophy,
various people have begun to add things to it that went beyond it's
intent, and would have warped the architecture in the ways Jerome and
David are concerned about.

It is precisely these forces that led me a few weeks ago to decide that
the time had come to make a "real" API. This API will would allow it's
integration into larger systems, and also allow people to develop the
components they need to get it to do the things that, since it really is
designed to be a single-purpose component, do not fit into it's
architecture.

However, this last set of conversations have led me to the conclusion
that it is very much necessary to write up a good exposition of
heartbeat's architecture and design, so that it is clear what it is
designed to do, what it designed NOT to do, and why it is designed the
way it is. With that background, discussions like this one that Jerome
started can be more readily and appropriately held.

> These are some of my thoughts after reading Mr. Etiennes comments and are not
> meant to restart any argument in particular, just to promote quiet
> contemplation of the alternatives.

I trust that I haven't fanned any flames on this subject either.

-- Alan Robertson
alanr@suse.com
why i don't like heartbeat [ In reply to ]
Jerome,

Thanks for the comments. It *is* good to have them all in the archives
in one place. Since no one had asked before, I hadn't bothered
explaining the more interesting and non-obvious design decisions that
are in it. You've made it clear that there would be significant value
to doing that.

Jerome Etienne wrote:

> Further, even if i dont like his program, alan roberston still
> calmly explains its points of view and i appreciate that.

Thanks! I fear that I haven't always succeeded in being as calm as I
would like but thanks anyway!

> In short, what i don't like in heartbeat:
>
> 1. Heartbeat try to design/rewrite a network stack. i claim that IP already
> does the job. IP has been designed/implemented by experienced people,
> i think it would be a mistake not to use it.
> Moreover to rewrite it will be a HUGE job, and very long to debug.

The protocol code in heartbeat is under 300 lines.

> In my opinion, a program called heartbeat should monitor the neighbors
> life and report the result, no more. Others will take decisions based
> on the data reported.

I completely agree with this comment, but probably not in the way Jerome
intended. Heartbeat may be misnamed. I agree it shouldn't take any
actions on it's own. It doesn't. The "cluster manager" that takes
actions is separate from heartbeat. and I have always viewed it as being
completely separate. It uses shell scripts because they were a
prototype cluster manager that I wrote on a weekend before a scheduled
demo.

> IF rewriting a network stack is needed(and not use IP), IF rewriting
> a serial protocol (and not use slip/ppp) is needed, IF a reliable
> multicast protocol is needed (what i question especially for a life
> monitor), IF rewriting one is needed (and not used one of the ten
> existing ones http://www.tascnets.com/mist/doc/mcpCompare.html),
> heartbeat isnt the good place to implement it.

Actually, heartbeat does what it does the way it does as a specifically
to minimize complexity. There is significant synergy between the
messaging protocol and the heartbeat code. I really do need to explain
this at more length.

> 3. It isnt documented, so except the people who has the time to study
> the code (i tried), nobody will review the custom security (btw it
> is commonly admited that a unreviewed security should be assumed
> untrustable), the custom network protocols etc... If there is
> a mistake in it, the more we wait the harder it will be to fix.
> This is undesirable in general but really harmfull in a HA environment.

The code is commented, but it is becoming apparent that there is a
significant need to document the architecture and design. It looks like
I should take the time to do that.

> What i like in heartbeat:
> 1. it exists and so complies to the linus's rule 'show me the sources'

So, now I get to write up a good explanation of heartbeat. I'm due to
give a quick WIP presentation and run a BOF on heartbeat and HA stuff at
SANE 2000 in about 10 days. Maybe I can have a draft to talk from by
then. I'll probably concentrate first on the protocol aspects, and add
the security aspects after I'm done with these.

Maybe I can make a presentation on this at somewhere really cool to
visit, after it's done ;-)

Thanks!

-- Alan Robertson
alanr@suse.com
why i don't like heartbeat [ In reply to ]
dgould@suse.com wrote:
>
> Also, a lot of things are discussed, but
> it is not always clear what the current concensus is.

Consensus?

What's that ;-)

-- Alan Robertson
alanr@suse.com