Mailing List Archive

Integration with HA SW
I have examined the interface to the heartbeat software and found it quite
rudimentary. In the case heartbeat needs to activate or deactivate a
resource a simple 'drdbsetup primary' or 'drbdsetup secondary' is issued.
There is no code that checks about the actual status of drbd, which means
that in cases that are not mainstream it is most likely that you get in some
kind of curious condition. This code works as long as you are running simple
"reset this node", "switch off that node" scenarios. However real life is
more complex. Several times during my testing I e.g. found both nodes
primary, one node in state WFConnection the other one as StandAlone.

In our next integration step we would like to integrate DRBD into a
commercial HighAvailability framework that allows to handle more complex
situations. Easy operations like graceful failover of DRBD devices should be
possible and the status of the local resources (like DRBD) should be
monitored.

However I see some difficulties to control DRBD from two nodes because there
definitely is some dependency regarding the state.

Is there any state-diagram that shows the different states and transitions
regarding the strings shown in /proc/drbd?

Is there any recommended procedure to do a graceful failover? Should it be
done from both sides issueing the appropriate 'drbdsetup secondary' and
'drbdsetup primary' commands or is it better to do everything from the new
active node using the 'secondary_remote' feature.

Has anyone else experiences with integrating DRBD with other HA software
beside 'heartbeat'?


/Wolfram



=======================================================================
Wolfram Weyer FORCE COMPUTERS GmbH
Staff Engineer - Systems Engineering A Solectron Subsidiary
phone: +49 89 60814-523 Street: Prof.-Messerschmitt-Str. 1
fax: +49 89 60814-112 City: D-85579 Neubiberg/Muenchen
mailto:Wolfram.Weyer@example.com http://www.forcecomputers.com
=======================================================================
Re: Integration with HA SW [ In reply to ]
* Weyer, Wolfram <Wolfram.Weyer@example.com> [011113 16:08]:
> I have examined the interface to the heartbeat software and found it quite
> rudimentary. In the case heartbeat needs to activate or deactivate a
> resource a simple 'drdbsetup primary' or 'drbdsetup secondary' is issued.
> There is no code that checks about the actual status of drbd, which means
> that in cases that are not mainstream it is most likely that you get in some
> kind of curious condition. This code works as long as you are running simple
> "reset this node", "switch off that node" scenarios. However real life is
> more complex. Several times during my testing I e.g. found both nodes
> primary, one node in state WFConnection the other one as StandAlone.
>
> In our next integration step we would like to integrate DRBD into a
> commercial HighAvailability framework that allows to handle more complex
> situations. Easy operations like graceful failover of DRBD devices should be
> possible and the status of the local resources (like DRBD) should be
> monitored.
>
> However I see some difficulties to control DRBD from two nodes because there
> definitely is some dependency regarding the state.
>
> Is there any state-diagram that shows the different states and transitions
> regarding the strings shown in /proc/drbd?
>

I was thinking about drawing such a diagram for the next DRBD paper, but
did not included it in the end... (I thought that it was not interesing,
just obvious)

> Is there any recommended procedure to do a graceful failover? Should it be
> done from both sides issueing the appropriate 'drbdsetup secondary' and
> 'drbdsetup primary' commands or is it better to do everything from the new
> active node using the 'secondary_remote' feature.

This does not matter. Just avoid having both nodes in primary state.

> Has anyone else experiences with integrating DRBD with other HA software
> beside 'heartbeat'?

it seems that there is a drbd module for failsave
http://oss.sgi.com/projects/failsafe/

I have never tried it, but it is an indicator that it is possible to
manage DRBD with an other cluster manager than heartbeat.

BTW: The author(s) of this glue layer never contacted me, so it seems
possible to do such a thing without Philipp-Reisner-insight :)

-Philipp
Re: Integration with HA SW [ In reply to ]
On 2001-11-14T21:10:24,
Philipp Reisner <philipp.reisner@example.com> said:

> it seems that there is a drbd module for failsave
> http://oss.sgi.com/projects/failsafe/
>
> I have never tried it, but it is an indicator that it is possible to
> manage DRBD with an other cluster manager than heartbeat.
>
> BTW: The author(s) of this glue layer never contacted me, so it seems
> possible to do such a thing without Philipp-Reisner-insight :)

Well, the current FailSafe glue is rather rudimentary and based on what
heartbeat does.

I do have a rather neat idea about how to properly support it and other such
configurations (ie, replicated databases and so on are very similiar). This
involves by FailSafe not only controlling the primary, but also the secondary
as an "active" resource, some smart selection of which node to run on and
active influence from the primary resource on the shadow one. Without any
human interaction *knock on wood*

I do have a gut feeling that this might have some rather unexpected state
transitions, so I will end up drawing a diagram (though on paper ;) next week
I guess.

And yes, I do think this is something you won't easily be able to do with
heartbeat as it is, because its resource manager is way too simplistic still.
The Open Clustering Framework might allow to fix that.

Sincerely,
Lars Marowsky-Brée <lmb@example.com>

--
Perfection is our goal, excellence will be tolerated. -- J. Yahl
RE: Integration with HA SW [ In reply to ]
I think I am walking in the same direction (however using the GoAhead
SelfReliant Middleware).

My idea is to skip the whole "drbd start/stop" story and manage the
DRBD-Status of both nodes from the clustering middleware. I agree that
heartbeat is not suitable for that :-(
I consider a node-status of 'Connected/Primary', 'SynchingAll/Primary',
'QuickSynch/Primary' and 'Connected/Secondary' as "Healthy". Otherwise the
local node resource is "Failed". Within the set of healthy drbd instances
the clustering framework decides which one becomes active.
In the case only one node is available we have to consider manual
intervention. As the node is in state 'WFConnection/<Primary|Secondary>' it
is marked as failed and will not become active until the instance is set to
"Healthy" by manual intervention. This will be the step where normally would
you answer 'yes' in the drbd startup script to force your upcoming node to
be primary.
Currently I don't know how to handle states like 'StandAlone/Primary' or
'Unconnected/Secondary', therefor I am looking for state diagrams as well
;-)

I have seen some problems dealing with states that normally do not occur in
the heartbeat framework (e.g. Unconnected/Secondary). As the GoAhead
Framework recognizes node failures and switches over resources within a
second there is additional potential for problems, because DRBD has to
switch to primary although the current primary was available a few 100
milliseconds before.

/Wolfram


>>> -----Original Message-----
>>> From: Lars Marowsky-Bree [mailto:lmb@example.com]
>>> Sent: Donnerstag, 15. November 2001 21:41
>>> To: Philipp Reisner
>>> Cc: Weyer, Wolfram; drbd-devel@example.com
>>> Subject: Re: [DRBD-dev] Integration with HA SW
>>>
>>>
>>> On 2001-11-14T21:10:24,
>>> Philipp Reisner <philipp.reisner@example.com> said:
>>>
>>> > it seems that there is a drbd module for failsave
>>> > http://oss.sgi.com/projects/failsafe/
>>> >
>>> > I have never tried it, but it is an indicator that it is
>>> possible to
>>> > manage DRBD with an other cluster manager than heartbeat.
>>> >
>>> > BTW: The author(s) of this glue layer never contacted me,
>>> so it seems
>>> > possible to do such a thing without
>>> Philipp-Reisner-insight :)
>>>
>>> Well, the current FailSafe glue is rather rudimentary and
>>> based on what
>>> heartbeat does.
>>>
>>> I do have a rather neat idea about how to properly support
>>> it and other such
>>> configurations (ie, replicated databases and so on are very
>>> similiar). This
>>> involves by FailSafe not only controlling the primary, but
>>> also the secondary
>>> as an "active" resource, some smart selection of which node
>>> to run on and
>>> active influence from the primary resource on the shadow
>>> one. Without any
>>> human interaction *knock on wood*
>>>
>>> I do have a gut feeling that this might have some rather
>>> unexpected state
>>> transitions, so I will end up drawing a diagram (though on
>>> paper ;) next week
>>> I guess.
>>>
>>> And yes, I do think this is something you won't easily be
>>> able to do with
>>> heartbeat as it is, because its resource manager is way too
>>> simplistic still.
>>> The Open Clustering Framework might allow to fix that.
>>>
>>> Sincerely,
>>> Lars Marowsky-Brée <lmb@example.com>
>>>
>>> --
>>> Perfection is our goal, excellence will be tolerated. -- J. Yahl
>>>