Mailing List Archive

Implementing the Heartbeat API
This is a multi-part message in MIME format.
--------------DC9CFC494943C9EF537941A6
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit

Hi Marcelo,

David Brower and others have expressed interests in this, so I've
decided to send this to the linux-ha-dev list, and let others offer
their comments too. Because of David's request, I decided to CC a
couple of FailSafe people too.

I've attached my latest version of the API header file.

Marcelo Tosatti wrote:
>
> Alan,
>
> How do you plan to do the interprocess communication? Signals and FIFO's?
> I'm asking that because i'm interested in starting implementing it.


Actually, I had been planning on implementing it... But I probably
don't have time...

Let's go ahead and discuss it - and decide who will implement it as we
go along...

If you wind up doing it, I'll want to review it very carefully.
This is an item that I would put in blue on the TODO list...

Expect me to be cranky and nit-picky if you do it. I reserve the right
to be unreasonable... ;-)

Let's put out a release with the current fixes in it before we commit
any of these changes to CVS.

Horms: Are you ready for this new release now?

Here's my plan:

Write the requests to the common FIFO /var/run/heartbeat-fifo/

Make a well-known client FIFO directory for clients to make FIFOs in
pid == FIFO name... Probably /var/run/heartbeat-clients

The messages in the FIFOs would be the famous "ha_msg" messages... ;-)

Add special message types to handle the queries from clients.

Add a well-known field type maybe "orig_pid" which specifies the PID
of the process making the request (hence the FIFO name)

Locally handled requests should probably have some kind of convention
in their types like "lr-" or something... Then you could make
sure they don't accidentally get written to the cluster, and
whine about them in the logs, and return an automatic
response reporting failure for unimplemented requests.

Make sure you handle dead clients or clients whose reply FIFOs might
be full...

Replies to messages should have types that match the request, but end
in "-resp"

In the case of the list of interfaces, I was planning on the return
message being a comma-separated list of interfaces - that
way all the remote messages will be be one-for-one returned
for each request. This should be OK to limit the number of
bytes in the interface names to no more than about 1K bytes
per host... ;-)

Otherwise, we need to implement guaranteed packet delivery order
which I don't want to put in the way of implementing this API.

We should have a version of the API in the header which goes
into each request, like this:
#define API_COMM_VERS 1
and then put api_vers (or something) into each request from the
clients. Make it a simple number, not a dotted number so that
we can easily compare less than, greater than, or equal.
Changing the meaning and format, or number of fields for a
request requires upping the number. Adding new requests
doesn't. Unimplemented requests are easily detectable.

I planned on replicating all the messages to all the attached clients,
except for replies that have orig_pid in them (but see the
debugging mode below).

Debugging needs a promiscuous mode so that a process can sit and
monitor
the traffic separately from whatever applications are using
the system. It might even be nice to have such a process be able
to make it "really promiscuous", and then see *all*
heartbeats from all machines - including those normally
filtered out.

David Brower made the very reasonable request to make this match
corresponding FailSafe APIs. This makes sense, but I haven't looked at
them enough yet to comment. I'm back home now, so I should be able to
do this soon.

I suspect that the big deal will be the communications protocol between
heartbeat and client, not the exact format of the APIs. So, if we have
to tweak them, or implement a failsafe compatibility layer for the APIs,
it should be pretty easy, once the comm stuff is designed and
implemented.

I guess I should get serious about checking the failsafe docs to
minimize the rework, or maybe even get better APIs... ;-)

Comments?


-- Alan Robertson
alanr@suse.com
--------------DC9CFC494943C9EF537941A6
Content-Type: text/plain; charset=us-ascii;
name="hb_api.h"
Content-Transfer-Encoding: 7bit
Content-Disposition: inline;
filename="hb_api.h"

#include <ha_msg.h>

/*
* Low-level clustering API to heartbeat.
*/

typedef void (*llc_msg_callback_t) (const struct ha_msg* msg
, void* private_data);

typedef void (*llc_nstatus_callback_t) (const char *node, const char * status
, void* private_data);

typedef void (*llc_ifstatus_callback_t) (const char *node
, const char * interface, const char * status
, void* private_data);

struct llc_ops {
/*
*************************************************************************
* Status Update Callbacks
*************************************************************************
*/

/*
* set_msg_callback: Define callback for the given message type
*
* msgtype: Type of message being handled. NULL for default case.
* Note that default case not reached for node
* status messages handled by nstatus_callback,
* or ifstatus messages handled by nstatus_callback,
* Not just those explicitly handled by "msg_hander"
* cases.
*
* callback: callback function.
*
* p: private data - later passed to callback.
*/
int (*set_msg_callback) (const char * msgtype
, llc_msg_callback_t callback, void * p);

/*
* set_nstatus_callback: Define callback for node status messages
* This is a message of type "st"
*
* cbf: callback function.
*
* p: private data - later passed to callback.
*/

int (*set_nstatus_callback) (llc_nstatus_callback_t cbf
, void * p);
/*
* set_ifstatus_callback: Define callback for interface status messages
* This is a message of type "???"
* These messages are issued whenever an interface goes
* dead or becomes active again.
*
* cbf: callback function.
*
* node: the name of the node to get the interface updates for
* If node is NULL, it will receive notification for all
* nodes.
*
* iface: The name of the interface to receive updates for. If
* iface is NULL, it will receive notification for all
* interfaces.
*
* If NULL is passed for both "node" and "iface", then "cbf" would
* be called for interface status change against any node in
* the cluster.
*
* p: private data - later passed to callback.
*/

int (*set_ifstatus_callback) (llc_ifstatus_callback_t cbf,
const char * node, const char * iface, void * p);


/*
*************************************************************************
* Getting Current Information
*************************************************************************
*/

/*
* init_nodewalk: Initialize walk through list of list of known nodes
*/
int (*init_nodewalk)(void);
/*
* nextnode: Return next node in the list of known nodes
*/
const char * (*nextnode)(void);
/*
* end_nodewalk: End walk through the list of known nodes
*/
int (*end_nodewalk)(void);
/*
* node_status: Return most recent heartbeat status of the given node
*/
int (*node_status)(const char * nodename);
/*
* init_ifwalk: Initialize walk through list of list of known interfaces
*/
int (*init_ifwalk)(const char * node);
/*
* nextif: Return next node in the list of known interfaces on node
*/
const char * (*nextif)(void);
/*
* end_ifwalk: End walk through the list of known interfaces
*/
int (*end_ifwalk)(void);
/*
* if_status: Return current status of the given interface
*/
int (*if_status)(const char * nodename, const char *iface);

/*
*************************************************************************
* Intracluster messaging
*************************************************************************
*/

/*
* sendclustermsg: Send the given message to all cluster members
*/
int (*sendclustermsg)(const struct ha_msg* msg);
/*
* sendnodemsg: Send the given message to the given node in cluster.
*/
int (*sendnodemsg)(const struct ha_msg* msg
, const char * nodename);

/*
* inputfd: Return fd which can be given to select(2) or poll(2)
* for determining when messages are ready to be read.
* Only to be used in select() or poll(), please...
*/
int (*inputfd)(void);
/*
* msgready: Returns TRUE (1) when a message is ready to be read.
*/
int (*msgready)(void);
/*
* setmsgsignal: Associates the given signal with the "message waiting"
* condition.
*/
int (*setmsgsignal)(int signo);
/*
* rcvmsg: Cause the next message to be read - activating callbacks for
* processing the message.
*/
int (*rcvmsg)(int blocking);

/*
* Read the next message without any silly callbacks.
* (at least the next one not intercepted by another callback).
* NOTE: you must dispose of this message by calling ha_msg_del().
*/
struct ha_msg* (*readmsg)(int blocking);

/*
*************************************************************************
* Debugging
*************************************************************************
*
* setfmode: Set filter mode. Analagous to promiscous mode in TCP.
*
* LLC_FILTER_DEFAULT (default)
* In this mode, all messages destined for this pid
* are received, along with all that don't go to specific pids.
*
* LLC_FILTER_PMODE See all messages, but filter heart beats
*
* that don't tell us anything new.
* LLC_FILTER_ALLHB See all heartbeats, including those that
* don't change status.
* LLC_FILTER_RAW See all packets, from all interfaces, even
* dups. Pkts with auth errors are still ignored.
*
* Set filter mode. Analagous to promiscous mode in TCP.
*
*/
# define LLC_FILTER_DEFAULT 0
# define LLC_FILTER_PMODE 1

/* Do we need these higher levels ? */

# define LLC_FILTER_ALLHB 2
# define LLC_FILTER_RAW 3

struct ha_msg* (*setfmode)(int mode);
};


struct ll_cluster {
void * ll_cluster_private;
struct llc_ops* llc_ops;
};

--------------DC9CFC494943C9EF537941A6--
Implementing the Heartbeat API [ In reply to ]
Alan,

Since we are doing this API, why not remove all handling of
resources and let another program (which can be much more capable than
our way of handling resources) do it?
This could be done with a Mon module using our API.
All stuff of link/services dependancies is in Mon already, also it would
monitor _services_, not simply if heartbeat on the other side is working,
as we currently do.

Comments?

On Wed, 26 Apr 2000, Alan Robertson wrote:

> Hi Marcelo,
>
> David Brower and others have expressed interests in this, so I've
> decided to send this to the linux-ha-dev list, and let others offer
> their comments too. Because of David's request, I decided to CC a
> couple of FailSafe people too.
>
> I've attached my latest version of the API header file.
>
> Marcelo Tosatti wrote:
> >
> > Alan,
> >
> > How do you plan to do the interprocess communication? Signals and FIFO's?
> > I'm asking that because i'm interested in starting implementing it.
>
>
> Actually, I had been planning on implementing it... But I probably
> don't have time...
>
> Let's go ahead and discuss it - and decide who will implement it as we
> go along...
>
> If you wind up doing it, I'll want to review it very carefully.
> This is an item that I would put in blue on the TODO list...
>
> Expect me to be cranky and nit-picky if you do it. I reserve the right
> to be unreasonable... ;-)
>
> Let's put out a release with the current fixes in it before we commit
> any of these changes to CVS.
>
> Horms: Are you ready for this new release now?
>
> Here's my plan:
>
> Write the requests to the common FIFO /var/run/heartbeat-fifo/
>
> Make a well-known client FIFO directory for clients to make FIFOs in
> pid == FIFO name... Probably /var/run/heartbeat-clients
>
> The messages in the FIFOs would be the famous "ha_msg" messages... ;-)
>
> Add special message types to handle the queries from clients.
>
> Add a well-known field type maybe "orig_pid" which specifies the PID
> of the process making the request (hence the FIFO name)
>
> Locally handled requests should probably have some kind of convention
> in their types like "lr-" or something... Then you could make
> sure they don't accidentally get written to the cluster, and
> whine about them in the logs, and return an automatic
> response reporting failure for unimplemented requests.
>
> Make sure you handle dead clients or clients whose reply FIFOs might
> be full...
>
> Replies to messages should have types that match the request, but end
> in "-resp"
>
> In the case of the list of interfaces, I was planning on the return
> message being a comma-separated list of interfaces - that
> way all the remote messages will be be one-for-one returned
> for each request. This should be OK to limit the number of
> bytes in the interface names to no more than about 1K bytes
> per host... ;-)
>
> Otherwise, we need to implement guaranteed packet delivery order
> which I don't want to put in the way of implementing this API.
>
> We should have a version of the API in the header which goes
> into each request, like this:
> #define API_COMM_VERS 1
> and then put api_vers (or something) into each request from the
> clients. Make it a simple number, not a dotted number so that
> we can easily compare less than, greater than, or equal.
> Changing the meaning and format, or number of fields for a
> request requires upping the number. Adding new requests
> doesn't. Unimplemented requests are easily detectable.
>
> I planned on replicating all the messages to all the attached clients,
> except for replies that have orig_pid in them (but see the
> debugging mode below).
>
> Debugging needs a promiscuous mode so that a process can sit and
> monitor
> the traffic separately from whatever applications are using
> the system. It might even be nice to have such a process be able
> to make it "really promiscuous", and then see *all*
> heartbeats from all machines - including those normally
> filtered out.
>
> David Brower made the very reasonable request to make this match
> corresponding FailSafe APIs. This makes sense, but I haven't looked at
> them enough yet to comment. I'm back home now, so I should be able to
> do this soon.
>
> I suspect that the big deal will be the communications protocol between
> heartbeat and client, not the exact format of the APIs. So, if we have
> to tweak them, or implement a failsafe compatibility layer for the APIs,
> it should be pretty easy, once the comm stuff is designed and
> implemented.
>
> I guess I should get serious about checking the failsafe docs to
> minimize the rework, or maybe even get better APIs... ;-)
>
> Comments?
>
>
> -- Alan Robertson
> alanr@suse.com
Implementing the Heartbeat API [ In reply to ]
On Thu, 4 May 2000, Lars Marowsky-Bree wrote:

> On 2000-05-04T11:22:05,
> Marcelo Tosatti <marcelo@conectiva.com.br> said:
>
> > Since we are doing this API, why not remove all handling of
> > resources and let another program (which can be much more capable than
> > our way of handling resources) do it?
>
> To this part I agree.
>
> > This could be done with a Mon module using our API.
>
> This may be possible, but the resource handling is beyond what mon can do.
> Separating it clearly is a good idea, but not with mon, please.

What is the problem with Mon? I've not tested it enough but what i saw
until now (like the services/dependancies scheme) looks good.
Implementing the Heartbeat API [ In reply to ]
On Thu, 4 May 2000, David Brower wrote:

> IMO, dragging this in starts getting too close to the stuff
> that failsafe does. If we're going to do failsafe, we should
> just do failsafe, not re-implement it from the incomplete
> and unintegrated parts in the closet.

I agree with you. But sitting and waiting for FailSafe without seeing one
line of code is quite bad IMHO. There is no architectural documentation at
all. The only interesting thing i've found is "IRIS FailSafe 2.0
Programmers Guide" which describes its API.
Btw, there is no cluster communication services to be used by the
applications. (DRBD wants something like this, for example).

> What's going on with failsafe? There's been no word for several
> weeks.
I really would like to know. Alan, Simon?

>
> -dB
>
>
> Marcelo Tosatti wrote:
> >
> > On Thu, 4 May 2000, Lars Marowsky-Bree wrote:
> >
> > > On 2000-05-04T11:22:05,
> > > Marcelo Tosatti <marcelo@conectiva.com.br> said:
> > >
> > > > Since we are doing this API, why not remove all handling of
> > > > resources and let another program (which can be much more capable than
> > > > our way of handling resources) do it?
> > >
> > > To this part I agree.
> > >
> > > > This could be done with a Mon module using our API.
> > >
> > > This may be possible, but the resource handling is beyond what mon can do.
> > > Separating it clearly is a good idea, but not with mon, please.
> >
> > What is the problem with Mon? I've not tested it enough but what i saw
> > until now (like the services/dependancies scheme) looks good.
> >
Implementing the Heartbeat API [ In reply to ]
On Thu, 4 May 2000, Lars Marowsky-Bree wrote:

> On 2000-05-04T13:12:43,
> Marcelo Tosatti <marcelo@conectiva.com.br> said:
>
> > I agree with you. But sitting and waiting for FailSafe without seeing one
> > line of code is quite bad IMHO. There is no architectural documentation at
> > all. The only interesting thing i've found is "IRIS FailSafe 2.0
> > Programmers Guide" which describes its API.
> > Btw, there is no cluster communication services to be used by the
> > applications. (DRBD wants something like this, for example).
>
> Both statements are wrong, even if I agree with the work you want to do.
>
> At http://oss.sgi.com/projects/failsafe/doc0.html, you will find some more
> links, including "Functional Specification and Architecture" at
> http://oss.sgi.com/projects/failsafe/docs/spec_arch.html.
Sorry about my mistake and thank you for the pointers but thats
still not enough to sit and way for an undetermined code release IMHO.

> FailSafe is in fact only part of the puzzle, sitting on top of the CHAOS
> ("Clustered High Availability Operating Services"), which has messaging
> services, membership services et cetera.
This will be opensourced? There are any documents on these?
>
> > > What's going on with failsafe? There's been no word for several
> > > weeks.
> > I really would like to know. Alan, Simon?
>
> The correct person at SGI to talk to would in fact be Mayank. I will send him
> some mail right away.
Thank you.

>
> The last status I have is that we have started working on the documentation
> scrutinizing. I do have no current date for any release though, except a very
> vague handwaving to be done with the full port (round one) in fall and release
> parts as they are ready.

It would be _very_ interesting if the SGI people could post
the current clean (clean in terms of GPLizing) even if it does not
work and it is not documented. Is that possible?
Implementing the Heartbeat API [ In reply to ]
On 2000-05-04T11:22:05,
Marcelo Tosatti <marcelo@conectiva.com.br> said:

> Since we are doing this API, why not remove all handling of
> resources and let another program (which can be much more capable than
> our way of handling resources) do it?

To this part I agree.

> This could be done with a Mon module using our API.

This may be possible, but the resource handling is beyond what mon can do.
Separating it clearly is a good idea, but not with mon, please.

Sincerely,
Lars Marowsky-Brée <lmb@suse.de>
Development HA

--
Perfection is our goal, excellence will be tolerated. -- J. Yahl
Implementing the Heartbeat API [ In reply to ]
On 2000-05-04T11:29:52,
Marcelo Tosatti <marcelo@conectiva.com.br> said:

> > This may be possible, but the resource handling is beyond what mon can do.
> > Separating it clearly is a good idea, but not with mon, please.
> What is the problem with Mon? I've not tested it enough but what i saw
> until now (like the services/dependancies scheme) looks good.

It is way too complex for what we want to do.

And how would you make mon do the resource handling, where to start and stop a
resource et cetera?

Sincerely,
Lars Marowsky-Brée <lmb@suse.de>
Development HA

--
Perfection is our goal, excellence will be tolerated. -- J. Yahl
Implementing the Heartbeat API [ In reply to ]
IMO, dragging this in starts getting too close to the stuff
that failsafe does. If we're going to do failsafe, we should
just do failsafe, not re-implement it from the incomplete
and unintegrated parts in the closet.

What's going on with failsafe? There's been no word for
several weeks.

-dB


Marcelo Tosatti wrote:
>
> On Thu, 4 May 2000, Lars Marowsky-Bree wrote:
>
> > On 2000-05-04T11:22:05,
> > Marcelo Tosatti <marcelo@conectiva.com.br> said:
> >
> > > Since we are doing this API, why not remove all handling of
> > > resources and let another program (which can be much more capable than
> > > our way of handling resources) do it?
> >
> > To this part I agree.
> >
> > > This could be done with a Mon module using our API.
> >
> > This may be possible, but the resource handling is beyond what mon can do.
> > Separating it clearly is a good idea, but not with mon, please.
>
> What is the problem with Mon? I've not tested it enough but what i saw
> until now (like the services/dependancies scheme) looks good.
>
> _______________________________________________________
> Linux-HA-Dev: Linux-HA-Dev@lists.tummy.com
> http://lists.tummy.com/mailman/listinfo/linux-ha-dev
> Home Page: http://linux-ha.org/

--
Butterflies tell me to say:
"The statements and opinions expressed here are my own and do not necessarily
represent those of Oracle Corporation."
Implementing the Heartbeat API [ In reply to ]
On Thu, May 04, 2000 at 10:35:59AM -0700, David Brower wrote:
> IMO, dragging this in starts getting too close to the stuff
> that failsafe does. If we're going to do failsafe, we should
> just do failsafe, not re-implement it from the incomplete
> and unintegrated parts in the closet.

Agreed, though I would like to be able to take a closer look
at failsafe.

> What's going on with failsafe? There's been no word for
> several weeks.


--
Horms
Implementing the Heartbeat API [ In reply to ]
On 2000-05-04T10:35:59,
David Brower <dbrower@us.oracle.com> said:

> IMO, dragging this in starts getting too close to the stuff
> that failsafe does. If we're going to do failsafe, we should
> just do failsafe, not re-implement it from the incomplete
> and unintegrated parts in the closet.

Separating the layers of heartbeat doesn't interfere with FailSafe at all,
quite the contrary. The "heartbeat" mechanism in heartbeat (yeah, confusing;)
is actually superior to the one used in FailSafe, and it would be useful to
exchange the specific part of FailSafe with the respective heartbeat code.

Sincerely,
Lars Marowsky-Brée <lmb@suse.de>
Development HA

--
Perfection is our goal, excellence will be tolerated. -- J. Yahl
Implementing the Heartbeat API [ In reply to ]
On 2000-05-04T13:12:43,
Marcelo Tosatti <marcelo@conectiva.com.br> said:

> I agree with you. But sitting and waiting for FailSafe without seeing one
> line of code is quite bad IMHO. There is no architectural documentation at
> all. The only interesting thing i've found is "IRIS FailSafe 2.0
> Programmers Guide" which describes its API.
> Btw, there is no cluster communication services to be used by the
> applications. (DRBD wants something like this, for example).

Both statements are wrong, even if I agree with the work you want to do.

At http://oss.sgi.com/projects/failsafe/doc0.html, you will find some more
links, including "Functional Specification and Architecture" at
http://oss.sgi.com/projects/failsafe/docs/spec_arch.html.

FailSafe is in fact only part of the puzzle, sitting on top of the CHAOS
("Clustered High Availability Operating Services"), which has messaging
services, membership services et cetera.

> > What's going on with failsafe? There's been no word for several
> > weeks.
> I really would like to know. Alan, Simon?

The correct person at SGI to talk to would in fact be Mayank. I will send him
some mail right away.

The last status I have is that we have started working on the documentation
scrutinizing. I do have no current date for any release though, except a very
vague handwaving to be done with the full port (round one) in fall and release
parts as they are ready.

Sincerely,
Lars Marowsky-Brée <lmb@suse.de>
Development HA

--
Perfection is our goal, excellence will be tolerated. -- J. Yahl
Implementing the Heartbeat API [ In reply to ]
Lars Marowsky-Bree wrote:
>
> On 2000-05-04T10:35:59,
> David Brower <dbrower@us.oracle.com> said:
>
> > IMO, dragging this in starts getting too close to the stuff
> > that failsafe does. If we're going to do failsafe, we should
> > just do failsafe, not re-implement it from the incomplete
> > and unintegrated parts in the closet.
>
> Separating the layers of heartbeat doesn't interfere with FailSafe at all,
> quite the contrary. The "heartbeat" mechanism in heartbeat (yeah, confusing;)
> is actually superior to the one used in FailSafe, and it would be useful to
> exchange the specific part of FailSafe with the respective heartbeat code.
>

I was referring more to the "mon" suggestion. I have no
opinions about the connectivity liveness checking.

-dB
Implementing the Heartbeat API [ In reply to ]
On 2000-05-04T13:35:10,
Marcelo Tosatti <marcelo@conectiva.com.br> said:

> Sorry about my mistake and thank you for the pointers but thats
> still not enough to sit and way for an undetermined code release IMHO.

I agree with you here.

This is why I am encouraging work which will help us right now and _also_ help
us with FailSafe when it becomes available. This is the perfect middle ground.

> This will be opensourced? There are any documents on these?

The CHAOS infrastructure is part of the FailSafe architecture and will be
open sourced too.

> It would be _very_ interesting if the SGI people could post
> the current clean (clean in terms of GPLizing) even if it does not
> work and it is not documented. Is that possible?

You could repeat the question on the FailSafe mailing list. I am not sure if
the SGI people are reading linux-ha-dev. They are the only ones who can say
yes or no to that.

Sincerely,
Lars Marowsky-Brée <lmb@suse.de>
Development HA

--
Perfection is our goal, excellence will be tolerated. -- J. Yahl
Implementing the Heartbeat API [ In reply to ]
Marcelo Tosatti wrote:
>
> Alan,
>
> Since we are doing this API, why not remove all handling of
> resources and let another program (which can be much more capable than
> our way of handling resources) do it?
> This could be done with a Mon module using our API.
> All stuff of link/services dependancies is in Mon already, also it would
> monitor _services_, not simply if heartbeat on the other side is working,
> as we currently do.
>
> Comments?


I agree about eventually moving all resource handling to outside of
heartbeat. However, like Lars and David, I don't think Mon is the way
to go for that part.

Let me suggest this plan for the short term is:

Implement the API

Write some test apps, and test the basic API functions
which can do some basic things like report node status.

Make a release of heartbeat which has these things in it.

Begin the task of writing a separate cluster/resource
manager/mangler ;-) in "C"

Release a version of heartbeat which has both the old and the
new cluster managers in it

Get some "field" experience with the new cluster manager

Phase out the old resource manager

Remove the old resource manager from heartbeat, and the old
"API" code from heartbeat

This whole process will take a few months.

Does this make sense?

Thanks!

-- Alan Robertson
alanr@suse.com
Implementing the Heartbeat API [ In reply to ]
David Brower wrote:
>
> IMO, dragging this in starts getting too close to the stuff
> that failsafe does. If we're going to do failsafe, we should
> just do failsafe, not re-implement it from the incomplete
> and unintegrated parts in the closet.
>
> What's going on with failsafe? There's been no word for
> several weeks.
>
> -dB


The SGI folks are going at a pretty furious pace making things run
properly under Linux. SGI documenters gave us a DocBook version of the
IRIX documentation. The SuSE folks are looking at overall schedules and
redoing the documentation to match the Linux port. Since FailSafe
depends on STONITH, I'm writing a little "white paper" so to speak on
STONITH, and how it can be implemented. Soon after that, we'll start
implementing a STONITH method for LinuxFailSafe. I'd expect it to be
usable with heartbeat too ;-)


-- Alan Robertson
alanr@suse.com
Implementing the Heartbeat API [ In reply to ]
On 2000-05-05T08:13:20,
Alan Robertson <alanr@suse.com> said:

> Let me suggest this plan for the short term is:
>
> Implement the API
>
> Write some test apps, and test the basic API functions
> which can do some basic things like report node status.
>
> Make a release of heartbeat which has these things in it.
>
> Begin the task of writing a separate cluster/resource
> manager/mangler ;-) in "C"
>
> Release a version of heartbeat which has both the old and the
> new cluster managers in it
>
> Get some "field" experience with the new cluster manager
>
> Phase out the old resource manager
>
> Remove the old resource manager from heartbeat, and the old
> "API" code from heartbeat
>
> This whole process will take a few months.
>
> Does this make sense?

Yes, it makes a lot of sense. But maybe the resource mangler could be done
earlier? I had a colleague get gray hair when he read the current shell
scripts... ;-)

Sincerely,
Lars Marowsky-Brée <lmb@suse.de>
Development HA

--
Perfection is our goal, excellence will be tolerated. -- J. Yahl
Implementing the Heartbeat API [ In reply to ]
On Fri, 5 May 2000, Alan Robertson wrote:

> Marcelo Tosatti wrote:
> >
> > Alan,
> >
> > Since we are doing this API, why not remove all handling of
> > resources and let another program (which can be much more capable than
> > our way of handling resources) do it?
> > This could be done with a Mon module using our API.
> > All stuff of link/services dependancies is in Mon already, also it would
> > monitor _services_, not simply if heartbeat on the other side is working,
> > as we currently do.
> >
> > Comments?
>
>
> I agree about eventually moving all resource handling to outside of
> heartbeat. However, like Lars and David, I don't think Mon is the way
> to go for that part.
>
> Let me suggest this plan for the short term is:
>
> Implement the API
>
> Write some test apps, and test the basic API functions
> which can do some basic things like report node status.
>
> Make a release of heartbeat which has these things in it.
>
> Begin the task of writing a separate cluster/resource
> manager/mangler ;-) in "C"
>
> Release a version of heartbeat which has both the old and the
> new cluster managers in it
>
> Get some "field" experience with the new cluster manager
>
> Phase out the old resource manager
>
> Remove the old resource manager from heartbeat, and the old
> "API" code from heartbeat
>
> This whole process will take a few months.
>
> Does this make sense?
Yes
Implementing the Heartbeat API [ In reply to ]
And something else to think about.. Encapsulation of system specific stuff
(will make cross-platform portability much easier - i.e. no use of /proc on
GNU/Linux, etc. - Could make a generic "it works" solution and add system
specific fixes).

Matt Soffen
Web Intranet Developer
http://www.iso-ne.com/
==============================================
Boss - "My boss says we need some eunuch programmers."
Dilbert - "I think he means UNIX and I already know UNIX."
Boss - "Well, if the company nurse comes by, tell her I said
never mind."
- Dilbert -
==============================================


> -----Original Message-----
> From: Lars Marowsky-Bree [SMTP:lmb@suse.de]
> Sent: Friday, May 05, 2000 10:37 AM
> To: linux-ha-dev@lists.tummy.com
> Subject: Re: [Linux-ha-dev] Implementing the Heartbeat API
>
> On 2000-05-05T08:13:20,
> Alan Robertson <alanr@suse.com> said:
>
> > Let me suggest this plan for the short term is:
> >
> > Implement the API
> >
> > Write some test apps, and test the basic API functions
> > which can do some basic things like report node status.
> >
> > Make a release of heartbeat which has these things in it.
> >
> > Begin the task of writing a separate cluster/resource
> > manager/mangler ;-) in "C"
> >
> > Release a version of heartbeat which has both the old and the
> > new cluster managers in it
> >
> > Get some "field" experience with the new cluster manager
> >
> > Phase out the old resource manager
> >
> > Remove the old resource manager from heartbeat, and the old
> > "API" code from heartbeat
> >
> > This whole process will take a few months.
> >
> > Does this make sense?
>
> Yes, it makes a lot of sense. But maybe the resource mangler could be done
> earlier? I had a colleague get gray hair when he read the current shell
> scripts... ;-)
>
> Sincerely,
> Lars Marowsky-Bree <lmb@suse.de>
> Development HA
>
> --
> Perfection is our goal, excellence will be tolerated. -- J. Yahl
>
>
> _______________________________________________________
> Linux-HA-Dev: Linux-HA-Dev@lists.tummy.com
> http://lists.tummy.com/mailman/listinfo/linux-ha-dev
> Home Page: http://linux-ha.org/
Implementing the Heartbeat API [ In reply to ]
Hi Matt,

"Soffen, Matthew" wrote:
>
> And something else to think about.. Encapsulation of system specific stuff
> (will make cross-platform portability much easier - i.e. no use of /proc on
> GNU/Linux, etc. - Could make a generic "it works" solution and add system
> specific fixes).

I think, in general, that this is the idea. But, it's good to remind us
;-)

Thanks Matt!

-- Alan Robertson
alanr@suse.com
Implementing the Heartbeat API [ In reply to ]
Beating on a drum never hurts anything (unless its YOU who is the drum).

*g*

Matt Soffen
Web Intranet Developer
http://www.iso-ne.com/
==============================================
Boss - "My boss says we need some eunuch programmers."
Dilbert - "I think he means UNIX and I already know UNIX."
Boss - "Well, if the company nurse comes by, tell her I said
never mind."
- Dilbert -
==============================================


> -----Original Message-----
> From: Alan Robertson [SMTP:alanr@suse.com]
> Sent: Friday, May 05, 2000 2:50 PM
> To: linux-ha-dev@lists.tummy.com
> Subject: Re: [Linux-ha-dev] Implementing the Heartbeat API
>
> Hi Matt,
>
> "Soffen, Matthew" wrote:
> >
> > And something else to think about.. Encapsulation of system specific
> stuff
> > (will make cross-platform portability much easier - i.e. no use of /proc
> on
> > GNU/Linux, etc. - Could make a generic "it works" solution and add
> system
> > specific fixes).
>
> I think, in general, that this is the idea. But, it's good to remind us
> ;-)
>
> Thanks Matt!
>
> -- Alan Robertson
> alanr@suse.com
>
> _______________________________________________________
> Linux-HA-Dev: Linux-HA-Dev@lists.tummy.com
> http://lists.tummy.com/mailman/listinfo/linux-ha-dev
> Home Page: http://linux-ha.org/