Mailing List Archive

The nice nice_failback :)
This message is in MIME format. The first part should be readable text,
while the remaining parts are likely unreadable without MIME-aware tools.
Send mail to mime@docserver.cac.washington.edu for more info.

---1463788289-1734292058-955750202=:21658
Content-Type: TEXT/PLAIN; charset=US-ASCII

Hello!

I think, (I *hope* to be honest) this is the nicer nice failback
patch I ever did. It adds some features that I'll extend next monday,
like periodic resources_held messages and so.
I'd challenge the brave ones to test this code (I'm still hardtesting
it). If someone survive and give me some feedback, I'll put it on the
CVS on Monday :)

Have a nice weekend!

Luis

[ Luis Claudio R. Goncalves lclaudio@conectiva.com.br ]
[. BSc in Computer Science -- MSc coming soon -- Gospel User -- Linuxer ]
[. Fault Tolerance - Real-Time - Distributed Systems - IECLB - IS 40:31 ]
[. LateNite Programmer -- Jesus Is The Solid Rock On Which I Stand -- ]

---1463788289-1734292058-955750202=:21658
Content-Type: TEXT/PLAIN; charset=US-ASCII; name="actual_stuff.patch"
Content-Transfer-Encoding: BASE64
Content-ID: <Pine.LNX.4.21.0004141910020.21658@gus.conectiva>
Content-Description: Patch against heartbeat CVS of 04.13.2000
Content-Disposition: attachment; filename="actual_stuff.patch"

ZGlmZiAtcnVOIC9ob21lL2xjbGF1ZGlvL2xpbnV4LWhhL2hlYXJ0YmVhdC9o
YV9tc2cuaCBsaW51eC1oYS9oZWFydGJlYXQvaGFfbXNnLmgNCi0tLSAvaG9t
ZS9sY2xhdWRpby9saW51eC1oYS9oZWFydGJlYXQvaGFfbXNnLmgJV2VkIEFw
ciAxMiAyMDowMzo0OSAyMDAwDQorKysgbGludXgtaGEvaGVhcnRiZWF0L2hh
X21zZy5oCVRodSBBcHIgMTMgMTY6MTA6MDcgMjAwMA0KQEAgLTM2LDE0ICsz
NiwxNSBAQA0KICNkZWZpbmUgRl9BVVRIICAgICAgICAgICJhdXRoIgkJLyog
QXV0aGVudGljYXRpb24gc3RyaW5nICovDQogI2RlZmluZSBGX0ZJUlNUU0VR
ICAgICAgImZpcnN0c2VxIgkvKiBMb3dlc3Qgc2VxICMgdG8gcmV0cmFuc21p
dCAqLw0KICNkZWZpbmUgRl9MQVNUU0VRICAgICAgICJsYXN0c2VxIgkvKiBI
aWdoZXN0IHNlcSAjIHRvIHJldHJhbnNtaXQgKi8NCi0NCisjZGVmaW5lIEZf
UkVTCQkicmVzb3VyY2UiCS8qIFJlc291cmNlcyBoZWxkIGJ5IHRoZSBub2Rl
ICovDQorI2RlZmluZSBGX05SRVMJCSJub19vZl9yZXMiCS8qIE51bWJlciBv
ZiBSZXNvdXJjZXMgKi8NCiANCiAjZGVmaW5lCVRfU1RBVFVTCSJzdGF0dXMi
CS8qIE1lc3NhZ2UgdHlwZSA9IFN0YXR1cyAqLw0KICNkZWZpbmUJTk9TRVFf
UFJFRklYCSJOU18iCQkvKiBHaXZlIG5vIHNlcXVlbmNlIG51bWJlciAqLw0K
ICNkZWZpbmUJVF9SRVhNSVQJIk5TX3JleG1pdCIJLyogTWVzc2FnZSB0eXBl
ID0gUmV0cmFuc21pdCByZXF1ZXN0ICovDQogI2RlZmluZQlUX05BS1JFWE1J
VAkiTlNfbmFrX3JleG1pdCIJLyogTWVzc2FnZSB0eXBlID0gTkFLIFJlLXht
aXQgcnFzdCAqLw0KICNkZWZpbmUgVF9TVEFSVElORyAgICAgICJzdGFydGlu
ZyIgICAgICAvKiBNZXNzYWdlIHR5cGUgPSBTdGFydGluZyBIZWFydGJlYXQg
Ki8NCi0NCisjZGVmaW5lIFRfUkVTCQkicnNjX2hlbGQiCS8qIE1lc3NhZ2Ug
dHlwZSA9IExpc3Qgb2YgUmVzb3VyY2VzICovDQogDQogLyogQWxsb2NhdGUg
bmV3IChlbXB0eSkgbWVzc2FnZSAqLw0KIHN0cnVjdCBoYV9tc2cgKgloYV9t
c2dfbmV3KGludCBuZmllbGRzKTsNCmRpZmYgLXJ1TiAvaG9tZS9sY2xhdWRp
by9saW51eC1oYS9oZWFydGJlYXQvaGVhcnRiZWF0LmMgbGludXgtaGEvaGVh
cnRiZWF0L2hlYXJ0YmVhdC5jDQotLS0gL2hvbWUvbGNsYXVkaW8vbGludXgt
aGEvaGVhcnRiZWF0L2hlYXJ0YmVhdC5jCVdlZCBBcHIgMTIgMjA6MDM6NDkg
MjAwMA0KKysrIGxpbnV4LWhhL2hlYXJ0YmVhdC9oZWFydGJlYXQuYwlGcmkg
QXByIDE0IDE4OjQ2OjU4IDIwMDANCkBAIC0xNjQsNiArMTY0LDE2IEBADQog
I2RlZmluZQlEUk9QSVQJMSANCiAjZGVmaW5lIERVUExJQ0FURQkyDQogDQor
LyogRGVmaW5lcyB0byB1c2Ugb24gQ0xVU1RFUiBmbGFncyAqLw0KKyNkZWZp
bmUgTUVfQUxWCQkxNiAgLyogSSBhbSBhbGl2ZSAqLw0KKyNkZWZpbmUgTUVf
UFJJCQkzMiAgLyogSSBhbSB0aGUgcHJpbWFyeSBub2RlICovDQorI2RlZmlu
ZSBNRV9SU0MJCTY0ICAvKiBJIGhhdmUgdGhlIHJlc291cmNlcyAqLw0KKyNk
ZWZpbmUgTUVfU1RSCQkxMjggLyogSSBhbSBzdGFydGluZyAqLw0KKyNkZWZp
bmUgT1RfQUxWCQkxICAgLyogVGhlIG90aGVyIG5vZGUgaXMgYWxpdmUgKi8N
CisjZGVmaW5lIE9UX1BSSQkJMiAgIC8qIFRoZSBvdGhlciBub2RlIGlzIHRo
ZSBwcmltYXJ5ICovDQorI2RlZmluZSBPVF9SU0MJCTQgICAvKiBUaGUgb3Ro
ZXIgbm9kZSBoYXMgdGhlIHJlc291cmNlcyAqLw0KKyNkZWZpbmUgT1RfU1RS
CQk4ICAgLyogVGhlIG90aGVyIG5vZGUgaXMgc3RhcnRpbmcgKi8NCisgDQog
aW50CQl2ZXJib3NlID0gMDsNCiANCiBjb25zdCBjaGFyICoJY21kbmFtZSA9
ICJoZWFydGJlYXQiOw0KQEAgLTE3Niw3ICsxODYsOCBAQA0KIGludCAgICAg
ICAgICAgICB3ZV9hcmVfcHJpbWFyeSA9IDA7DQogaW50ICAgICAgICAgICAg
IHNlbmRfc3RhcnRpbmdfbm93ID0gMTsNCiBpbnQgICAgICAgICAgICAgbmlj
ZV9mYWlsYmFjayA9IDA7DQotaW50CQlzdGFydGluZyA9IDE7DQorY2hhcgkJ
Q0xVU1RFUiA9IDA7DQorY2hhcgkJRkFVTFRTID0gMDsNCiBpbnQJCWtpbGxy
dW5uaW5naGIgPSAwOw0KIGludAkJcnB0X2hiX3N0YXR1cyA9IDA7DQogaW50
CQljaGlsZHBpZCA9IC0xOw0KQEAgLTE5Miw2ICsyMDMsNyBAQA0KIGV4dGVy
biBjb25zdCBpbnQJbnVtX2hiX21lZGlhX3R5cGVzOw0KIGludAkJCW51bW1l
ZGlhID0gMDsNCiBpbnQJCQlzdGF0dXNfcGlwZVsyXTsJLyogVGhlIE1hc3Rl
ciBzdGF0dXMgcGlwZSAqLw0KK3N0cnVjdCBoYV9tc2cgKgkJcmVzb3VyY2Vz
X2hlbGQgPSBOVUxMOw0KIA0KIGNvbnN0IGNoYXIgKmhhX2xvZ19wcmlvcml0
eVs4XSA9IHsNCiAJImVtZXJnIiwNCkBAIC04MDEsOCArODEzLDcgQEANCiAJ
c3RydWN0IGhhX21zZyAqCQltc2cgPSBOVUxMOw0KIAlpbnQJCQlyZXNvdXJj
ZXNfcmVxdWVzdGVkX3lldCA9IDA7DQogCXRpbWVfdAkJCWxhc3Rub3cgPSAw
TDsNCi0JaW50IAkJCXJlY2VpdmVkX3N0YXJ0aW5nID0gMDsNCi0JY2hhcglp
ZmFjZVtNQVhJRkFDRUxFTl07DQorIAljaGFyCWlmYWNlW01BWElGQUNFTEVO
XTsgDQogCXN0cnVjdAlsaW5rICpsbms7DQogDQogCWluaXRfc3RhdHVzX2Fs
YXJtKCk7DQpAQCAtODEwLDYgKzgyMSw4IEBADQogDQogCWNsZWFyZXJyKGYp
Ow0KIA0KKwlDTFVTVEVSIHw9IChNRV9BTFYgKyBNRV9TVFIpOw0KKw0KIAlm
b3IgKDs7IChtc2cgIT0gTlVMTCkgJiYgKGhhX21zZ19kZWwobXNnKSxtc2c9
TlVMTCwgMSkpIHsNCiAJCXRpbWVfdAkJbXNndGltZTsNCiAJCXRpbWVfdAkJ
bm93ID0gdGltZShOVUxMKTsNCkBAIC04MjIsNyArODM1LDcgQEANCiAJCQlz
ZW5kX2xvY2FsX3N0YXR1cygpOw0KIAkJfQ0KIA0KLQkJaWYgKChzZW5kX3N0
YXJ0aW5nX25vdyAmJiBuaWNlX2ZhaWxiYWNrKSAmJiBzdGFydGluZykgew0K
KwkJaWYgKHNlbmRfc3RhcnRpbmdfbm93ICYmIChDTFVTVEVSICYmIE1FX1NU
UikpIHsNCiAJCQlzZW5kX3N0YXJ0aW5nX25vdyA9IDA7DQogCQkJaGFfbG9n
KExPR19ERUJVRywgIlNlbmRpbmcgc3RhcnRpbmcgbXNnIik7DQogCQkJc2Vu
ZF9sb2NhbF9zdGFydGluZygpOw0KQEAgLTkwNiwyNiArOTE5LDU3IEBADQog
CQkJfQ0KIAkJfQ0KIA0KLQ0KLQ0KKwkJaWYgKCB0aGlzbm9kZSAhPSBjdXJu
b2RlICkgew0KKwkJCS8qIHRoZSBvdGhlciBob3N0IGlzIGFsaXZlICovDQor
CQkJQ0xVU1RFUiB8PSBPVF9BTFY7DQorCQl9DQorCQkNCiAJCS8qIElmIHdl
J3JlIHN0YXJ0aW5nIGFuZCBhICJzdGFydGluZyIgbWVzc2FnZSBjYW1lIGZy
b20gYW5vdGhlcg0KIAkJICogIG5vZGUsIHRoZSBwcmltYXJ5IG1heSB0YWtl
IGl0cyByb2xlLiBFbHNlIGFjdCBhcyBzZWNvbmRhcnkgDQogCQkgKiAgKG9m
IGNvdXJzZSwgaWYgbmljZV9mYWlsYmFjayBpcyBvbikNCiAJCSovDQotCQkN
CiAJCWlmICghc3RyY2FzZWNtcCh0eXBlLE5PU0VRX1BSRUZJWCBUX1NUQVJU
SU5HKSANCi0JCSYmIHRoaXNub2RlICE9IGN1cm5vZGUgJiYgKHN0YXJ0aW5n
ICYmIG5pY2VfZmFpbGJhY2spKSB7DQorCQkmJiAodGhpc25vZGUgIT0gY3Vy
bm9kZSkgJiYgbmljZV9mYWlsYmFjaykgew0KICAgICAgICAgICAgICAgICAg
ICAgICAgIG5pY2VfZmFpbGJhY2sgPSAwOw0KKwkJCUNMVVNURVIgfD0gKE9U
X0FMVnxPVF9TVFIpOw0KIAkJCWNsdXN0ZXJfYWxyZWFkeV9hY3RpdmUgPSAw
Ow0KLQkJCXJlY2VpdmVkX3N0YXJ0aW5nID0gMTsNCi0JCQlzdGFydGluZyA9
IDA7DQogCQkJaGFfbG9nKExPR19ERUJVRywiUmVjZWl2ZWQgc3RhcnRpbmcg
bXNnIGZyb20gJXMiDQogCQkJCQksZnJvbSk7DQotCQkJc2VuZF9sb2NhbF9z
dGFydGluZygpOw0KKwkJCWlmIChDTFVTVEVSICYmIE1FX1NUUikgew0KKwkJ
CQloYV9sb2coTE9HX0RFQlVHLA0KKwkJCQkJCSJFdmVyeWJvZHkgaXMgc3Rh
cnRpbmcgbm93Li4uIik7DQorCQkJCXNlbmRfbG9jYWxfc3RhcnRpbmcoKTsN
CisJCQl9DQorCQkJLyogY29udGludWUgKi8NCisJCX0NCisJDQorCQlpZiAo
IXN0cmNhc2VjbXAodHlwZSxOT1NFUV9QUkVGSVggVF9TVEFSVElORykgDQor
CQkmJiAgKHRoaXNub2RlICE9IGN1cm5vZGUpICYmICEoQ0xVU1RFUiAmJiBN
RV9TVFIpKSB7DQorCQkJc2VuZF9yZXNvdXJjZXNfaGVsZCgpOw0KKwkJCWlm
IChhdG9pKGhhX21zZ192YWx1ZShyZXNvdXJjZXNfaGVsZCwgRl9OUkVTKSk8
IDEpIHsNCisJCQkJaGFfbG9nKExPR19ERUJVRywiVGhlIG90aGVyIGd1eSBp
cyBzdGFydGluZyINCisJCQkJCQkiYW5kIEkgaGF2ZSBubyByZXNvbHJjZXMu
Li4iKTsNCisJCQkJbmljZV9mYWlsYmFjayA9IDA7DQorCQkJCWhhX2xvZyhM
T0dfREVCVUcsIk1heSB0aGUgcHJpbWFyeSB0YWtlcyBwbGFjZSIpOw0KKwkJ
CQlyZXFfb3VyX3Jlc291cmNlcygpOw0KKwkJCX0NCiAJCQljb250aW51ZTsN
CiAJCX0NCi0JCQkJDQotCQkvKg0KKwkJCQ0KKwkJaWYgKCFzdHJjYXNlY21w
KHR5cGUsTk9TRVFfUFJFRklYIFRfUkVTKQ0KKwkJJiYgKHRoaXNub2RlICE9
IGN1cm5vZGUpKSB7DQorCQkJaGFfbG9nKExPR19JTkZPLCAiJXMgaGFzICVz
IHJlc291cmNlcyEiLA0KKwkJCQkJZnJvbSwgaGFfbXNnX3ZhbHVlKG1zZywg
Rl9OUkVTKSk7DQorCQkJaWYgKGF0b2koaGFfbXNnX3ZhbHVlKG1zZywgRl9O
UkVTKSk+IDApIHsNCisJCQkJLyogdGhlIG90aGVyIG5vZGUgaGFzIHRoZSBy
ZXNvdXJjZXMgKi8NCisJCQkJQ0xVU1RFUiB8PSBPVF9SU0M7DQorCQkJfSBl
bHNlIHsNCisJCQkJQ0xVU1RFUiAmPSAhKE9UX1JTQyk7DQorCQkJfQ0KKwkJ
CWNvbnRpbnVlOw0KKwkJfQ0KKw0KKwkvKg0KIAkJICogUmVxdWVzdCBvdXIg
cmVzb3VyY2VzIGFmdGVyIGEgKFBQUC1pbmR1Y2VkKSBkZWxheS4NCiAJCSAq
IElmIHdlIGhhdmUgUFBQIGFzIG91ciBvbmx5IGxpbmsgdGhpcyBkZWxheSBt
aWdodCBoYXZlDQogCQkgKiB0byBiZSA3IG9yIDggc2Vjb25kcy4gIE90aGVy
d2lzZSB0aGUgbmVlZGVkIGRlbGF5IGlzDQpAQCAtOTM2LDIwICs5ODAsMzgg
QEANCiAJCSAqLw0KIA0KICAgICAgICAgICAgICAgICBpZiAoIVdlQXJlUmVz
dGFydGluZyAmJiAhcmVzb3VyY2VzX3JlcXVlc3RlZF95ZXQNCi0JCSYmCSh0
aGlzbm9kZSAhPSBjdXJub2RlICYmIChub3ctc3RhcnR0aW1lKSA+IFJRU1RE
RUxBWSkpIHsNCi0JCQlpZiAobmljZV9mYWlsYmFjayAmJiAhcmVjZWl2ZWRf
c3RhcnRpbmcpIHsNCisJCSYmICgodGhpc25vZGUgIT0gY3Vybm9kZSkgJiYg
KG5vdy1zdGFydHRpbWUpID4gUlFTVERFTEFZKSkgew0KKw0KKwkJCUNMVVNU
RVIgJj0gIShNRV9TVFIpOw0KKwkJCWlmIChuaWNlX2ZhaWxiYWNrICYmICEo
Q0xVU1RFUiAmJiBPVF9TVFIpKSB7DQogCQkJCWhhX2xvZyhMT0dfREVCVUcs
DQogCQkJCQkiVGhlIGNsdXN0ZXIgaXMgYWxyZWFkeSBhY3RpdmUiKTsNCiAJ
CQkJY2x1c3Rlcl9hbHJlYWR5X2FjdGl2ZSA9IDE7DQorCQkJCQ0KKwkJCQlp
ZiAoQ0xVU1RFUiAmJiBPVF9SU0MpIHsNCisJCQkJCWhhX2xvZyhMT0dfREVC
VUcsDQorCQkJCQkiVGhlIG90aGVyIG5vZGUgaGFzIHRoZSByZXNvdXJjZXMi
KTsNCisJCQkJfSBlbHNlIHsNCisJCQkJCWhhX2xvZyhMT0dfREVCVUcsDQor
CQkJCQkiQnV0IG5vb25lIGhvbGRzIHRoZSByZXNvdXJjZXMuLi4iKTsNCisJ
CQkJCS8qIERvIHNvbWV0aGluZyBpbnRlbGlnZW50ICovDQorCQkJCQluaWNl
X2ZhaWxiYWNrID0gMDsNCisJCQkJCWhhX2xvZyhMT0dfREVCVUcsDQorCQkJ
CQkJIk1heSB0aGUgcHJpbWFyeSB0YWtlcyBwbGFjZSIpOw0KKwkJCQl9DQor
CQkJCQ0KIAkJCX0gZWxzZSB7DQotCQkJCWlmIChuaWNlX2ZhaWxiYWNrICYm
IHJlY2VpdmVkX3N0YXJ0aW5nKSB7DQorCQkJCWlmIChuaWNlX2ZhaWxiYWNr
ICYmIChDTFVTVEVSICYmIE9UX1NUUikpIHsNCiAJCQkJCWhhX2xvZyhMT0df
REVCVUcsDQogCQkJCQkJIkV2ZXJ5Ym9keSBpcyBzdGFydGluZyBub3ciKTsN
CisJCQkJCW5pY2VfZmFpbGJhY2s9MDsNCiAJCQkJfQ0KIAkJCX0NCiAJCQly
ZXNvdXJjZXNfcmVxdWVzdGVkX3lldD0xOw0KLQkJCXN0YXJ0aW5nID0gMDsN
CiAJCQlyZXFfb3VyX3Jlc291cmNlcygpOw0KKwkJCWlmIChDTFVTVEVSICYm
IE9UX1NUUikgew0KKwkJCQlDTFVTVEVSICY9ICFPVF9TVFI7DQorCQkJfQ0K
IAkJfQ0KIA0KIAkJaWYgKCFzdHJjYXNlY21wKHR5cGUsTk9TRVFfUFJFRklY
IFRfU1RBUlRJTkcpKSB7DQpAQCAtMTUwMSw2ICsxNTYzLDc1IEBADQogCXJl
dHVybihIQV9PSyk7DQogfQ0KIA0KKw0KKy8qIFNlbmQgcmVzb3VyY2VzX2hl
bGQgbGlzdCBvdXQgdG8gdGhlIGNsdXN0ZXIgKi8NCitpbnQNCitzZW5kX3Jl
c291cmNlc19oZWxkKHZvaWQpDQorew0KKyAgICAgICAgc3RydWN0IGhhX21z
ZyAqIG07DQorICAgICAgICBpbnQgICAgICAgICAgICAgcmM7DQorICAgICAg
ICBjaGFyICAgICAgICAgICAgdGltZXN0YW1wWzE2XTsNCisJY29uc3QgY2hh
ciAqCW5yaDsNCisNCisgICAgICAgIHNwcmludGYodGltZXN0YW1wLCAiJWx4
IiwgdGltZShOVUxMKSk7DQorDQorICAgICAgICAvKiBpZiAoZGVidWcpeyAq
Lw0KKyAgICAgICAgICAgICAgICBoYV9sb2coTE9HX0RFQlVHLCAiU2VuZGlu
ZyByZXNvdXJjZXMgaGVsZCBsaXN0IG1zZyIpOw0KKyAgICAgICAgLyogfSAq
Lw0KKyAgICAgICAgaWYgKChtPWhhX21zZ19uZXcoMCkpID09IE5VTEwpIHsN
CisgICAgICAgICAgICAgICAgaGFfbG9nKExPR19FUlIsICJDYW5ub3Qgc2Vu
ZCByZXNvdXJjZXMgaGVsZCBsaXN0IG1zZyIpOw0KKyAgICAgICAgICAgICAg
ICByZXR1cm4oSEFfRkFJTCk7DQorICAgICAgICB9DQorDQorCW5yaCA9IGhh
X21zZ192YWx1ZShyZXNvdXJjZXNfaGVsZCwgRl9OUkVTKTsNCisJaGFfbG9n
KExPR19ERUJVRywgIk51bWJlciBvZiByZXNvdXJjZXMgaGVsZDogICVzIiwg
bnJoKTsNCisJDQorICAgICAgICBpZiAoKGhhX21zZ19hZGQobSwgRl9UWVBF
LCBOT1NFUV9QUkVGSVggVF9SRVMpID09IEhBX0ZBSUwpIA0KKyAgICAgICAg
fHwgIChoYV9tc2dfYWRkKG0sIEZfT1JJRywgY3Vybm9kZS0+bm9kZW5hbWUp
ID09IEhBX0ZBSUwpDQorICAgICAgICB8fCAoaGFfbXNnX2FkZChtLCBGX1RJ
TUUsIHRpbWVzdGFtcCkgPT0gSEFfRkFJTCkNCisgICAgICAgIHx8ICAoaGFf
bXNnX2FkZChtLCBGX05SRVMsIG5yaCkgPT0gSEFfRkFJTCkpIHsNCisgICAg
ICAgICAgICAgICAgaGFfbG9nKExPR19FUlIsICJzZW5kX3Jlc291cmNlc19o
ZWxkOiAiDQorICAgICAgICAgICAgICAgICJDYW5ub3QgY3JlYXRlIHJlc291
cmNlcyBoZWxkIGxpc3QgbXNnIik7DQorICAgICAgICAgICAgICAgIHJjID0g
SEFfRkFJTDsNCisgICAgICAgIH0NCisJDQorCWhhX2xvZyhMT0dfREVCVUcs
ICJOdW1iZXIgb2YgcmVzb3VyY2VzIGZvciB0aGUgbXNnOiAgJXMiLA0KKwkJ
CWhhX21zZ192YWx1ZShtLCBGX05SRVMpKTsNCisJDQorCS8qIElmIHRoZSBt
ZXNzYWdlIGhlYWRlciBpcyBPSywgbGV0J3MgbG9vayBmb3IgdGhlIHJlc291
cmNlIGxpc3QNCisgCSAqIGFuZCBzZW5kIGl0IG91dCAqLw0KKwkNCisJaWYg
KHJjICE9IEhBX0ZBSUwpIHsNCisJCWludCAgICAgajsNCisJCWlmICghcmVz
b3VyY2VzX2hlbGQgfHwgIXJlc291cmNlc19oZWxkLT5uYW1lcyANCisJCQkJ
CXx8ICFyZXNvdXJjZXNfaGVsZC0+dmFsdWVzKSB7DQorCQkJaGFfbG9nKExP
R19ERUJVRywgDQorCQkJCSJzZW5kX3Jlc291cmNlc19oZWxkOiBvb3BzLCBu
byByZXNvdXJjZXMgaGVsZCIpOw0KKwkJfSBlbHNlIHsNCisJCQlmb3IgKGo9
MDsgaiA8IHJlc291cmNlc19oZWxkLT5uZmllbGRzOyArK2opIHsNCisJCQkJ
aWYgKHN0cmNtcChGX1JFUywgDQorCQkJCQlyZXNvdXJjZXNfaGVsZC0+bmFt
ZXNbal0pID09IDApIHsNCisJCQkJCWhhX2xvZyhMT0dfREVCVUcsICJyZXNv
dXJjZSBuYW1lOiAlcyIsIA0KKwkJCQkJCXJlc291cmNlc19oZWxkLT52YWx1
ZXNbal0pOw0KKwkJCQkJaWYgKGhhX21zZ19hZGQobSwgRl9SRVMsIA0KKwkJ
CQkJCXJlc291cmNlc19oZWxkLT52YWx1ZXNbal0pDQorCQkJCQkJCT09IEhB
X0ZBSUwpIHsNCisJCQkJCQlyYz1IQV9GQUlMOw0KKwkJCQkJfQ0KKwkJCQl9
DQorCQkJfQ0KKwkJfQ0KKwl9DQorCQ0KKwlpZiAocmMgIT0gSEFfRkFJTCkg
ew0KKyAgICAgICAgICAgICAgICByYyA9IHNlbmRfY2x1c3Rlcl9tc2cobSk7
DQorCX0NCisJDQorICAgICAgICBoYV9tc2dfZGVsKG0pOw0KKyAgICAgICAg
cmV0dXJuKHJjKTsNCit9DQorDQorDQogLyogU2VuZCB0aGUgc3RhcnRpbmcg
bXNnIG91dCB0byB0aGUgY2x1c3RlciAqLw0KIGludA0KIHNlbmRfbG9jYWxf
c3RhcnRpbmcodm9pZCkNCkBAIC0xNTE5LDggKzE2NTAsOCBAQA0KICAgICAg
ICAgICAgICAgICByZXR1cm4oSEFfRkFJTCk7DQogICAgICAgICB9DQogICAg
ICAgICBpZiAoKGhhX21zZ19hZGQobSwgRl9UWVBFLCBOT1NFUV9QUkVGSVgg
VF9TVEFSVElORykgPT0gSEFfRkFJTCkgDQotICAgICAgICAmJiAgKGhhX21z
Z19hZGQobSwgRl9PUklHLCBjdXJub2RlLT5ub2RlbmFtZSkgPT0gSEFfRkFJ
TCkNCi0gICAgICAgICYmICAoaGFfbXNnX2FkZChtLCBGX1RJTUUsIHRpbWVz
dGFtcCkgPT0gSEFfRkFJTCkpIHsNCisgICAgICAgIHx8ICAoaGFfbXNnX2Fk
ZChtLCBGX09SSUcsIGN1cm5vZGUtPm5vZGVuYW1lKSA9PSBIQV9GQUlMKQ0K
KyAgICAgICAgfHwgIChoYV9tc2dfYWRkKG0sIEZfVElNRSwgdGltZXN0YW1w
KSA9PSBIQV9GQUlMKSkgew0KICAgICAgICAgICAgICAgICBoYV9sb2coTE9H
X0VSUiwgInNlbmRfbG9jYWxfc3RhcnRpbmc6ICINCiAgICAgICAgICAgICAg
ICAgIkNhbm5vdCBjcmVhdGUgbG9jYWwgc3RhcnRpbmcgbXNnIik7DQogICAg
ICAgICAgICAgICAgIHJjID0gSEFfRkFJTDsNCkBAIC0xNTMyLDYgKzE2NjMs
NyBAQA0KICAgICAgICAgcmV0dXJuKHJjKTsNCiB9DQogDQorDQogLyogU2Vu
ZCBvdXIgbG9jYWwgc3RhdHVzIG91dCB0byB0aGUgY2x1c3RlciAqLw0KIGlu
dA0KIHNlbmRfbG9jYWxfc3RhdHVzKHZvaWQpDQpAQCAtMTU5Nyw5ICsxNzI5
LDkgQEANCiANCiAJaGVhcnRiZWF0X21vbml0b3IoaG1zZyk7DQogCQ0KLQlp
ZiAoc3RhcnRpbmcgJiYgbmljZV9mYWlsYmFjayAmJiBoaXAgIT0gY3Vybm9k
ZSkgew0KKwlpZiAoKENMVVNURVIgJiYgTUVfU1RSKSAmJiBuaWNlX2ZhaWxi
YWNrICYmIGhpcCAhPSBjdXJub2RlKSB7DQogCQloYV9sb2coTE9HX0RFQlVH
LCAiSSdtIGFsb25lLi4uICIpOw0KLQkJLyogVGhpcyBpcyBvbmUgb2YgdGhl
ICBwbGFjZSB0byBwdXQgdGhlDQorCQkvKiBUaGlzIGlzIG9uZSBvZiB0aGUg
cGxhY2VzIHRvIHB1dCB0aGUNCiAJCSAqIFNJVF9BTkRfQ1JZIHN0dWZmICov
DQogCQluaWNlX2ZhaWxiYWNrID0gMDsNCiAJCXJlcV9vdXJfcmVzb3VyY2Vz
KCk7DQpAQCAtMTc0Niw3ICsxODc4LDEzIEBADQogCWludAlmaW5hbHJjID0g
SEFfT0s7DQogCWludAlyYzsNCiAJaW50CXJzY19jb3VudCA9IDA7DQorCWlu
dCAJcnNjX2hlbGQgPSAwOw0KKwljaGFyIAlzcnNjX2hlbGRbMTBdOw0KIA0K
KyAgICAgICAgaWYgKChyZXNvdXJjZXNfaGVsZCA9IGhhX21zZ19uZXcoMCkp
ID09IE5VTEwpIHsNCisJCWhhX2xvZyhMT0dfRVJSLCAibm8gbWVtb3J5IHRv
IGNyZWF0ZSByZXNvdXJjZSBsaXN0Iik7DQorCQlyZXR1cm4oSEFfRkFJTCk7
DQorCX0NCiAJDQogCWhhX2xvZyhMT0dfSU5GTywgIlJlcXVlc3Rpbmcgb3Vy
IHJlc291cmNlcy4iKTsNCiAJc3ByaW50ZihjbWQsIEhBTElCICIvUmVzb3Vy
Y2VNYW5hZ2VyIGxpc3RrZXlzICVzIiwgY3Vybm9kZS0+bm9kZW5hbWUpOw0K
QEAgLTE3ODYsNiArMTkyNCwxMSBAQA0KIAkJCQloYV9wZXJyb3IoIiVzIHJl
dHVybmVkICVkIiwgZ2V0Y21kLCByYyk7DQogCQkJCWZpbmFscmM9SEFfRkFJ
TDsNCiAJCQl9DQorICAgICAgICAJCWlmIChoYV9tc2dfYWRkKHJlc291cmNl
c19oZWxkLCBGX1JFUywgYnVmKSA9PSBIQV9GQUlMKSB7DQorCQkJCWhhX2xv
ZyhMT0dfRVJSLCAibm8gbWVtb3J5IHRvIGxpc3QgcmVzb3VyY2UiKTsNCisJ
CQl9DQorCQkJKytyc2NfaGVsZDsNCisJCQloYV9sb2coTE9HX0RFQlVHLCAi
SSBnb3QgcmVzb3VyY2U6ICVzIiwgYnVmKTsNCiAJCX0NCiAJfQ0KIA0KQEAg
LTE3OTMsNyArMTkzNiwxMiBAQA0KICAgICAgICAgICAgICAgICBjbHVzdGVy
X2FscmVhZHlfYWN0aXZlID0gMDsNCiAgICAgICAgICAgICAgICAgd2VfYXJl
X3ByaW1hcnkgPSAxOw0KICAgICAgICAgfQ0KLQ0KKwkNCisJc3ByaW50Zihz
cnNjX2hlbGQsICIlZCIsIHJzY19oZWxkKTsNCisJaWYgKGhhX21zZ19hZGQo
cmVzb3VyY2VzX2hlbGQsIEZfTlJFUywgc3JzY19oZWxkKSA9PSBIQV9GQUlM
KSB7DQorCQloYV9sb2coTE9HX0VSUiwgIm5vIG1lbW9yeSB0byBsaXN0IHJl
c291cmNlIik7DQorCX0NCisJCQkNCiAJcmM9cGNsb3NlKHJrZXlzKTsNCiAJ
aWYgKHJjIDwgMCAmJiBlcnJubyAhPSBFQ0hJTEQpIHsNCiAJCWhhX3BlcnJv
cigicGNsb3NlKCVzKSByZXR1cm5lZCAlZCIsIGNtZCwgcmMpOw0KZGlmZiAt
cnVOIC9ob21lL2xjbGF1ZGlvL2xpbnV4LWhhL2hlYXJ0YmVhdC9oZWFydGJl
YXQuaCBsaW51eC1oYS9oZWFydGJlYXQvaGVhcnRiZWF0LmgNCi0tLSAvaG9t
ZS9sY2xhdWRpby9saW51eC1oYS9oZWFydGJlYXQvaGVhcnRiZWF0LmgJV2Vk
IEFwciAxMiAyMDowMzo0OSAyMDAwDQorKysgbGludXgtaGEvaGVhcnRiZWF0
L2hlYXJ0YmVhdC5oCVRodSBBcHIgMTMgMTY6MTA6MDcgMjAwMA0KQEAgLTI4
OCw2ICsyODgsNyBAQA0KIGV4dGVybiB2b2lkCQloYV9sb2coaW50IHByaW9y
aXR5LCBjb25zdCBjaGFyICogZm10LCAuLi4pOw0KIGV4dGVybiB2b2lkCQlo
YV9wZXJyb3IoY29uc3QgY2hhciAqIGZtdCwgLi4uKTsNCiBleHRlcm4gaW50
ICAgICAgICAgICAgICBzZW5kX2xvY2FsX3N0YXJ0aW5nKHZvaWQpOw0KK2V4
dGVybiBpbnQgICAgICAgICAgICAgIHNlbmRfcmVzb3VyY2VzX2hlbGQodm9p
ZCk7DQogZXh0ZXJuIGludAkJc2VuZF9sb2NhbF9zdGF0dXModm9pZCk7DQog
ZXh0ZXJuIGludAkJc2V0X2xvY2FsX3N0YXR1cyhjb25zdCBjaGFyICogc3Rh
dHVzKTsNCiBleHRlcm4gaW50CQlzZW5kX2NsdXN0ZXJfbXNnKHN0cnVjdCBo
YV9tc2cqbXNnKTsNCg==
---1463788289-1734292058-955750202=:21658--
The nice nice_failback :) [ In reply to ]
"Luis Claudio R. Goncalves" wrote:
>
> Hello!
>
> I think, (I *hope* to be honest) this is the nicer nice failback
> patch I ever did. It adds some features that I'll extend next monday,
> like periodic resources_held messages and so.
> I'd challenge the brave ones to test this code (I'm still hardtesting
> it). If someone survive and give me some feedback, I'll put it on the
> CVS on Monday :)

Sorry I was out of commission for a few days, so this is coming late,
and sounds like a broken record. I am generally opposed to putting
support of resources into heartbeat, particularly if they restrict
things to only two machines.

It must be possible (and not too awful) to run without any resource
management at all, and to run on more than 2 nodes. These are
essential. Don't short cut them.


-- Alan Robertson
alanr@suse.com
The nice nice_failback :) [ In reply to ]
Horms wrote:
>
> On Mon, Apr 17, 2000 at 09:39:47AM -0600, Alan Robertson wrote:
> > "Luis Claudio R. Goncalves" wrote:
> > >
> > > Hello!
> > >
> > > I think, (I *hope* to be honest) this is the nicer nice failback
> > > patch I ever did. It adds some features that I'll extend next monday,
> > > like periodic resources_held messages and so.
> > > I'd challenge the brave ones to test this code (I'm still hardtesting
> > > it). If someone survive and give me some feedback, I'll put it on the
> > > CVS on Monday :)
> >
> > Sorry I was out of commission for a few days, so this is coming late,
> > and sounds like a broken record. I am generally opposed to putting
> > support of resources into heartbeat, particularly if they restrict
> > things to only two machines.
>
> The problem is that heartbeat as it stands has a _serious_ flaw.

Agreed.

> If all links fail then resources become owned by more than one
> machine and will not be relinquished once links are re-established.

Yes, but a better way might be to have pseudo-quorum based on the
reachability of something like a router or switch or hub. And do
something I'll outline below, ALSO.

> At the moment the hack is to have as many links used for heartbeat
> communication as possible and hope that you never run into a situation
> where nodes lose communication with each other and yet are fully
> functional. This is in my opinion an acceptable situation in the sort
> term as in the case of 2 nodes a serial link should give you a
> fair amount of security against all links failing.

I don't think I'd call it a "hack", and would recommend it even if it
didn't help solve the problem. But, this is not to fundamentally
disagree with your assessment.

> To my mind to get around this problem the best way forward is to have
> nodes keep track of resources internally, when nodes change state they can
> check to see if a resource is has - or can potential have - is owned by
> any other nodes on the network. Without this assumptions have to be made
> about a node being accessible meaning that given resources are accessible.
> Especially in the case where there is no master for a resource, there
> is no way to make such assumptions without the possibility of situations
> where either resources are duplicated or disappear off the network.

With drbd (for example), you MUST NEVER have both sides have the mirror
mounted read-write simultaneously, so your solution is insufficient for
this. The only way I know to handle this is to follow Stephen's
suggestion of having a pseudo-quorum resource that you have to "own" in
order to own the master side of the mirror. It should work like this:
If you can reach the hub and you can't reach the master, then you may
take over the drbd resource.
If you can't reach the hub, then you should probably shut down, and
await it's becoming available again.

This will fail to work in the following very unlikely situation:
Both sides can reach the hub/switch/router,
Neither side can talk to the other (including via alternate paths)

[This is at least a double failure]

This also solves another important problem:
A side staying up when it can't serve it's customers.

Pardon me if I've forgotten, but does this solve the same problems as
you're trying to solve?

> > It must be possible (and not too awful) to run without any resource
> > management at all, and to run on more than 2 nodes. These are
> > essential. Don't short cut them.
>
> I agree with this but resource management is important. It should be
> possible for resource management to be inactive if there are no resources.

That's sufficient [as long as it's not too horribly ugly].

-- Alan Robertson
alanr@suse.com
The nice nice_failback :) [ In reply to ]
On Mon, 17 Apr 2000, Alan Robertson wrote:

> Horms wrote:
> >
> > On Mon, Apr 17, 2000 at 09:39:47AM -0600, Alan Robertson wrote:
> > > "Luis Claudio R. Goncalves" wrote:
> > > >
> > > > Hello!
> > > >
> > > > I think, (I *hope* to be honest) this is the nicer nice failback
> > > > patch I ever did. It adds some features that I'll extend next monday,
> > > > like periodic resources_held messages and so.
> > > > I'd challenge the brave ones to test this code (I'm still hardtesting
> > > > it). If someone survive and give me some feedback, I'll put it on the
> > > > CVS on Monday :)
> > >
> > > Sorry I was out of commission for a few days, so this is coming late,
> > > and sounds like a broken record. I am generally opposed to putting
> > > support of resources into heartbeat, particularly if they restrict
> > > things to only two machines.
> >
> > The problem is that heartbeat as it stands has a _serious_ flaw.
>
> Agreed.
>
> > If all links fail then resources become owned by more than one
> > machine and will not be relinquished once links are re-established.
>
> Yes, but a better way might be to have pseudo-quorum based on the
> reachability of something like a router or switch or hub. And do
> something I'll outline below, ALSO.
That was exactly our idea when we started to help with heartbeat.
The "FIXME: do something useful here" on my patch is there because we
need scripts to "do" the pseudo-quorum. Luis already explained this
in a past message.
(http://lists.tummy.com/pipermail/linux-ha-dev/2000-March/000460.html)


> > At the moment the hack is to have as many links used for heartbeat
> > communication as possible and hope that you never run into a situation
> > where nodes lose communication with each other and yet are fully
> > functional. This is in my opinion an acceptable situation in the sort
> > term as in the case of 2 nodes a serial link should give you a
> > fair amount of security against all links failing.
>
> I don't think I'd call it a "hack", and would recommend it even if it
> didn't help solve the problem. But, this is not to fundamentally
> disagree with your assessment.
>
> > To my mind to get around this problem the best way forward is to have
> > nodes keep track of resources internally, when nodes change state they can
> > check to see if a resource is has - or can potential have - is owned by
> > any other nodes on the network. Without this assumptions have to be made
> > about a node being accessible meaning that given resources are accessible.
> > Especially in the case where there is no master for a resource, there
> > is no way to make such assumptions without the possibility of situations
> > where either resources are duplicated or disappear off the network.
>
> With drbd (for example), you MUST NEVER have both sides have the mirror
> mounted read-write simultaneously, so your solution is insufficient for
> this. The only way I know to handle this is to follow Stephen's
> suggestion of having a pseudo-quorum resource that you have to "own" in
> order to own the master side of the mirror. It should work like this:
> If you can reach the hub and you can't reach the master, then you may
> take over the drbd resource.
> If you can't reach the hub, then you should probably shut down, and
> await it's becoming available again.
>
> This will fail to work in the following very unlikely situation:
> Both sides can reach the hub/switch/router,
> Neither side can talk to the other (including via alternate paths)
Then both sit_and_cry().
> [This is at least a double failure]
>
> This also solves another important problem:
> A side staying up when it can't serve it's customers.
>
> Pardon me if I've forgotten, but does this solve the same problems as
> you're trying to solve?
Horms?
>
> > > It must be possible (and not too awful) to run without any resource
> > > management at all, and to run on more than 2 nodes. These are
> > > essential. Don't short cut them.
> >
> > I agree with this but resource management is important. It should be
> > possible for resource management to be inactive if there are no resources.
The nice nice_failback :) [ In reply to ]
Howdy again!

On Mon, 17 Apr 2000, Alan Robertson wrote:
> Sorry I was out of commission for a few days, so this is coming late,
> and sounds like a broken record. I am generally opposed to putting
> support of resources into heartbeat, particularly if they restrict
> things to only two machines.

It's not support for resources management or so... it's only a
beatiful trick(tm) to solve a race. In my previous message I've
explained why I restricted the start protocol to the two hosts
scene. Anyway it stills possible to use N nodes and have the
nice_failback on or off...

> It must be possible (and not too awful) to run without any resource
> management at all, and to run on more than 2 nodes. These are
> essential. Don't short cut them.

Ok, ok, you're the boss :P

Luis

[ Luis Claudio R. Goncalves lclaudio@conectiva.com.br ]
[. BSc in Computer Science -- MSc coming soon -- Gospel User -- Linuxer ]
[. Fault Tolerance - Real-Time - Distributed Systems - IECLB - IS 40:31 ]
[. LateNite Programmer -- Jesus Is The Solid Rock On Which I Stand -- ]
The nice nice_failback :) [ In reply to ]
On Mon, Apr 17, 2000 at 09:39:47AM -0600, Alan Robertson wrote:
> "Luis Claudio R. Goncalves" wrote:
> >
> > Hello!
> >
> > I think, (I *hope* to be honest) this is the nicer nice failback
> > patch I ever did. It adds some features that I'll extend next monday,
> > like periodic resources_held messages and so.
> > I'd challenge the brave ones to test this code (I'm still hardtesting
> > it). If someone survive and give me some feedback, I'll put it on the
> > CVS on Monday :)
>
> Sorry I was out of commission for a few days, so this is coming late,
> and sounds like a broken record. I am generally opposed to putting
> support of resources into heartbeat, particularly if they restrict
> things to only two machines.

The problem is that heartbeat as it stands has a _serious_ flaw.
If all links fail then resources become owned by more than one
machine and will not be relinquished once links are re-established.
At the moment the hack is to have as many links used for heartbeat
communication as possible and hope that you never run into a situation
where nodes lose communication with each other and yet are fully
functional. This is in my opinion an acceptable situation in the sort
term as in the case of 2 nodes a serial link should give you a
fair amount of security against all links failing.

To my mind to get around this problem the best way forward is to have
nodes keep track of resources internally, when nodes change state they can
check to see if a resource is has - or can potential have - is owned by
any other nodes on the network. Without this assumptions have to be made
about a node being accessible meaning that given resources are accessible.
Especially in the case where there is no master for a resource, there
is no way to make such assumptions without the possibility of situations
where either resources are duplicated or disappear off the network.


> It must be possible (and not too awful) to run without any resource
> management at all, and to run on more than 2 nodes. These are
> essential. Don't short cut them.

I agree with this but resource management is important. It should be
possible for resource management to be inactive if there are no resources.


--
Horms
The nice nice_failback :) [ In reply to ]
Hi Luis,

I am not up enough on the current state of the rest of this code to
comment on correctness, but I do have a few comments about style and
thought you might be helped by them. The comments are sprinkled about
your code below.

-dg

On Fri, Apr 14, 2000 at 07:10:02PM -0300, Luis Claudio R. Goncalves wrote:
> Hello!
>
> patch I ever did. It adds some features that I'll extend next monday,
> like periodic resources_held messages and so.
> I'd challenge the brave ones to test this code (I'm still hardtesting
> it). If someone survive and give me some feedback, I'll put it on the
> CVS on Monday :)
>
> Have a nice weekend!
>
> Luis
>
> [ Luis Claudio R. Goncalves lclaudio@conectiva.com.br ]
> [. BSc in Computer Science -- MSc coming soon -- Gospel User -- Linuxer ]
> [. Fault Tolerance - Real-Time - Distributed Systems - IECLB - IS 40:31 ]
> [. LateNite Programmer -- Jesus Is The Solid Rock On Which I Stand -- ]

Content-Description: Patch against heartbeat CVS of 04.13.2000
> diff -ruN /home/lclaudio/linux-ha/heartbeat/ha_msg.h linux-ha/heartbeat/ha_msg.h
> --- /home/lclaudio/linux-ha/heartbeat/ha_msg.h Wed Apr 12 20:03:49 2000
> +++ linux-ha/heartbeat/ha_msg.h Thu Apr 13 16:10:07 2000
> @@ -36,14 +36,15 @@
> #define F_AUTH "auth" /* Authentication string */
> #define F_FIRSTSEQ "firstseq" /* Lowest seq # to retransmit */
> #define F_LASTSEQ "lastseq" /* Highest seq # to retransmit */
> -
> +#define F_RES "resource" /* Resources held by the node */
> +#define F_NRES "no_of_res" /* Number of Resources */

Why not follow the example of the other defines that are the same name
as the string and use :
#define F_RESOURCE "resource"
#define F_NO_OF_RES" "no_of_res" /* or even F_NO_OF_RESOURCES ? */


> #define T_STATUS "status" /* Message type = Status */
> #define NOSEQ_PREFIX "NS_" /* Give no sequence number */
> #define T_REXMIT "NS_rexmit" /* Message type = Retransmit request */
> #define T_NAKREXMIT "NS_nak_rexmit" /* Message type = NAK Re-xmit rqst */
> #define T_STARTING "starting" /* Message type = Starting Heartbeat */
> -
> +#define T_RES "rsc_held" /* Message type = List of Resources */
>
> /* Allocate new (empty) message */
> struct ha_msg * ha_msg_new(int nfields);
> diff -ruN /home/lclaudio/linux-ha/heartbeat/heartbeat.c linux-ha/heartbeat/heartbeat.c
> --- /home/lclaudio/linux-ha/heartbeat/heartbeat.c Wed Apr 12 20:03:49 2000
> +++ linux-ha/heartbeat/heartbeat.c Fri Apr 14 18:46:58 2000
> @@ -164,6 +164,16 @@
> #define DROPIT 1
> #define DUPLICATE 2
>
> +/* Defines to use on CLUSTER flags */
> +#define ME_ALV 16 /* I am alive */
> +#define ME_PRI 32 /* I am the primary node */
> +#define ME_RSC 64 /* I have the resources */
> +#define ME_STR 128 /* I am starting */
> +#define OT_ALV 1 /* The other node is alive */
> +#define OT_PRI 2 /* The other node is the primary */
> +#define OT_RSC 4 /* The other node has the resources */
> +#define OT_STR 8 /* The other node is starting */
> +


I would like more descriptive names. Think of the guy trying to debug
this in the middle of the night ...

In general you seem to want to save on both bits and on typing. In HA code
correctness is the main concern, and in my experience correctness is
most easily achieved by focussing on clarity above all other goals.

> int verbose = 0;
>
> const char * cmdname = "heartbeat";
> @@ -176,7 +186,8 @@
> int we_are_primary = 0;
> int send_starting_now = 1;
> int nice_failback = 0;
> -int starting = 1;
> +char CLUSTER = 0;
> +char FAULTS = 0;

I really object to the uppercase variable names. In most places uppercase
signifies that the name is a macro of some kind, so it is somewhat
misleading to introduce variables like "CLUSTER".

> int killrunninghb = 0;
> int rpt_hb_status = 0;
> int childpid = -1;
> @@ -192,6 +203,7 @@
> extern const int num_hb_media_types;
> int nummedia = 0;
> int status_pipe[2]; /* The Master status pipe */
> +struct ha_msg * resources_held = NULL;
>
> const char *ha_log_priority[8] = {
> "emerg",
> @@ -801,8 +813,7 @@
> struct ha_msg * msg = NULL;
> int resources_requested_yet = 0;
> time_t lastnow = 0L;
> - int received_starting = 0;
> - char iface[MAXIFACELEN];
> + char iface[MAXIFACELEN];
> struct link *lnk;
>
> init_status_alarm();
> @@ -810,6 +821,8 @@
>
> clearerr(f);
>
> + CLUSTER |= (ME_ALV + ME_STR);
> +
> for (;; (msg != NULL) && (ha_msg_del(msg),msg=NULL, 1)) {
> time_t msgtime;
> time_t now = time(NULL);
> @@ -822,7 +835,7 @@
> send_local_status();
> }
>
> - if ((send_starting_now && nice_failback) && starting) {
> + if (send_starting_now && (CLUSTER && ME_STR)) {
BUG? ^^ bug?
Perhaps you meant "CLUSTER & ME_STR"?

> send_starting_now = 0;
> ha_log(LOG_DEBUG, "Sending starting msg");
> send_local_starting();
> @@ -906,26 +919,57 @@
> }
> }
>
> -
> -
> + if ( thisnode != curnode ) {
> + /* the other host is alive */
> + CLUSTER |= OT_ALV;
> + }
> +
> /* If we're starting and a "starting" message came from another
> * node, the primary may take its role. Else act as secondary
> * (of course, if nice_failback is on)
> */
> -
> if (!strcasecmp(type,NOSEQ_PREFIX T_STARTING)
> - && thisnode != curnode && (starting && nice_failback)) {
> + && (thisnode != curnode) && nice_failback) {

Why does it not matter if you are starting anymore?

> nice_failback = 0;
> + CLUSTER |= (OT_ALV|OT_STR);

Here is where the names start to make a difference, I now find I have to
skip back to a header file to find out what this might mean as this is
not as expressive as something like
cluster_state |= (OTHER_ALIVE | OTHER_STARTING);
would have been.

> cluster_already_active = 0;
> - received_starting = 1;
> - starting = 0;
> ha_log(LOG_DEBUG,"Received starting msg from %s"
> ,from);
> - send_local_starting();
> + if (CLUSTER && ME_STR) {
BUG? ^^ bug?
Perhaps you meant "CLUSTER & ME_STR"?

> + ha_log(LOG_DEBUG,
> + "Everybody is starting now...");
> + send_local_starting();
> + }
> + /* continue */
> + }
> +
> + if (!strcasecmp(type,NOSEQ_PREFIX T_STARTING)
> + && (thisnode != curnode) && !(CLUSTER && ME_STR)) {
BUG? ^^ bug?
Perhaps you meant "CLUSTER & ME_STR"?

> + send_resources_held();
> + if (atoi(ha_msg_value(resources_held, F_NRES))< 1) {
> + ha_log(LOG_DEBUG,"The other guy is starting"
> + "and I have no resolrces...");
^^^^^^^^^
spelling

> + nice_failback = 0;
> + ha_log(LOG_DEBUG,"May the primary takes place");
^^^^^^^^^^^^^^^^^^^^^^^^^^^
Please rephrase, I can't tell what this means.

> + req_our_resources();
> + }
> continue;
> }
> -
> - /*
> +
> + if (!strcasecmp(type,NOSEQ_PREFIX T_RES)
> + && (thisnode != curnode)) {
> + ha_log(LOG_INFO, "%s has %s resources!",
> + from, ha_msg_value(msg, F_NRES));
> + if (atoi(ha_msg_value(msg, F_NRES))> 0) {
> + /* the other node has the resources */
> + CLUSTER |= OT_RSC;
> + } else {
> + CLUSTER &= !(OT_RSC);
BUG? ^ bug?
Perhaps you meant "CLUSTER &= ~OT_RSC"?

> + }
> + continue;
> + }
> +
> + /*
> * Request our resources after a (PPP-induced) delay.
> * If we have PPP as our only link this delay might have
> * to be 7 or 8 seconds. Otherwise the needed delay is
> @@ -936,20 +980,38 @@
> */
>
> if (!WeAreRestarting && !resources_requested_yet
> - && (thisnode != curnode && (now-starttime) > RQSTDELAY)) {
> - if (nice_failback && !received_starting) {
> + && ((thisnode != curnode) && (now-starttime) > RQSTDELAY)) {
> +
> + CLUSTER &= !(ME_STR);
BUG? ^ bug?
Perhaps you meant "^", not "!"?


> + if (nice_failback && !(CLUSTER && OT_STR)) {
BUG? ^^ bug?
Perhaps you meant "&", not "&&"?

> ha_log(LOG_DEBUG,
> "The cluster is already active");
> cluster_already_active = 1;
> +
> + if (CLUSTER && OT_RSC) {
BUG? ^^ bug?
Perhaps you meant "&", not "&&"?

> + ha_log(LOG_DEBUG,
> + "The other node has the resources");
> + } else {
> + ha_log(LOG_DEBUG,
> + "But noone holds the resources...");
> + /* Do something inteligent */
> + nice_failback = 0;
> + ha_log(LOG_DEBUG,
> + "May the primary takes place");
> + }
> +
> } else {
> - if (nice_failback && received_starting) {
> + if (nice_failback && (CLUSTER && OT_STR)) {
BUG? ^^ bug?
Perhaps you meant "&", not "&&"?

> ha_log(LOG_DEBUG,
> "Everybody is starting now");
> + nice_failback=0;
> }
> }
> resources_requested_yet=1;
> - starting = 0;
> req_our_resources();
> + if (CLUSTER && OT_STR) {
BUG? ^^ bug?
Perhaps you meant "&", not "&&"?

> + CLUSTER &= !OT_STR;
BUG? ^ bug?
Perhaps you meant "^", not "!"?

> + }
> }
>
> if (!strcasecmp(type,NOSEQ_PREFIX T_STARTING)) {
> @@ -1501,6 +1563,75 @@
> return(HA_OK);
> }
>
> +
> +/* Send resources_held list out to the cluster */
> +int
> +send_resources_held(void)
> +{
> + struct ha_msg * m;
> + int rc;
> + char timestamp[16];
> + const char * nrh;
> +
> + sprintf(timestamp, "%lx", time(NULL));
> +
> + /* if (debug){ */
> + ha_log(LOG_DEBUG, "Sending resources held list msg");
> + /* } */
> + if ((m=ha_msg_new(0)) == NULL) {
> + ha_log(LOG_ERR, "Cannot send resources held list msg");
> + return(HA_FAIL);
> + }
> +
> + nrh = ha_msg_value(resources_held, F_NRES);
> + ha_log(LOG_DEBUG, "Number of resources held: %s", nrh);
> +
> + if ((ha_msg_add(m, F_TYPE, NOSEQ_PREFIX T_RES) == HA_FAIL)
> + || (ha_msg_add(m, F_ORIG, curnode->nodename) == HA_FAIL)
> + || (ha_msg_add(m, F_TIME, timestamp) == HA_FAIL)
> + || (ha_msg_add(m, F_NRES, nrh) == HA_FAIL)) {
> + ha_log(LOG_ERR, "send_resources_held: "
> + "Cannot create resources held list msg");
> + rc = HA_FAIL;
> + }
> +
> + ha_log(LOG_DEBUG, "Number of resources for the msg: %s",
> + ha_msg_value(m, F_NRES));

I think this is rudundant with the ha_log(LOG_DEBUG...) about 10 lines up.

> +
> + /* If the message header is OK, let's look for the resource list
> + * and send it out */
> +
> + if (rc != HA_FAIL) {
> + int j;
> + if (!resources_held || !resources_held->names
> + || !resources_held->values) {
> + ha_log(LOG_DEBUG,
> + "send_resources_held: oops, no resources held");

Would it be better to check all the prerequisites before building the
message?

> + } else {
> + for (j=0; j < resources_held->nfields; ++j) {
> + if (strcmp(F_RES,
> + resources_held->names[j]) == 0) {
I do not understand this test at all. What are you trying to do here?

> + ha_log(LOG_DEBUG, "resource name: %s",
> + resources_held->values[j]);
> + if (ha_msg_add(m, F_RES,
> + resources_held->values[j])
> + == HA_FAIL) {
> + rc=HA_FAIL;
> + }
> + }
> + }
> + }
> + }
> +
> + if (rc != HA_FAIL) {
> + rc = send_cluster_msg(m);
> + }
> +
> + ha_msg_del(m);
> + return(rc);
> +}
> +
> +
> /* Send the starting msg out to the cluster */
> int
> send_local_starting(void)
> @@ -1519,8 +1650,8 @@
> return(HA_FAIL);
> }
> if ((ha_msg_add(m, F_TYPE, NOSEQ_PREFIX T_STARTING) == HA_FAIL)
> - && (ha_msg_add(m, F_ORIG, curnode->nodename) == HA_FAIL)
> - && (ha_msg_add(m, F_TIME, timestamp) == HA_FAIL)) {
> + || (ha_msg_add(m, F_ORIG, curnode->nodename) == HA_FAIL)
> + || (ha_msg_add(m, F_TIME, timestamp) == HA_FAIL)) {

Nice catch! I think.

> ha_log(LOG_ERR, "send_local_starting: "
> "Cannot create local starting msg");
> rc = HA_FAIL;
> @@ -1532,6 +1663,7 @@
> return(rc);
> }
>
> +
> /* Send our local status out to the cluster */
> int
> send_local_status(void)
> @@ -1597,9 +1729,9 @@
>
> heartbeat_monitor(hmsg);
>
> - if (starting && nice_failback && hip != curnode) {
> + if ((CLUSTER && ME_STR) && nice_failback && hip != curnode) {
BUG? ^^ bug?
Perhaps you meant "&", not "&&"?

> ha_log(LOG_DEBUG, "I'm alone... ");
> - /* This is one of the place to put the
> + /* This is one of the places to put the
> * SIT_AND_CRY stuff */
> nice_failback = 0;
> req_our_resources();
> @@ -1746,7 +1878,13 @@
> int finalrc = HA_OK;
> int rc;
> int rsc_count = 0;
> + int rsc_held = 0;
> + char srsc_held[10];
>
> + if ((resources_held = ha_msg_new(0)) == NULL) {
> + ha_log(LOG_ERR, "no memory to create resource list");
> + return(HA_FAIL);
> + }
>
> ha_log(LOG_INFO, "Requesting our resources.");
> sprintf(cmd, HALIB "/ResourceManager listkeys %s", curnode->nodename);
> @@ -1786,6 +1924,11 @@
> ha_perror("%s returned %d", getcmd, rc);
> finalrc=HA_FAIL;
> }
> + if (ha_msg_add(resources_held, F_RES, buf) == HA_FAIL) {
> + ha_log(LOG_ERR, "no memory to list resource");
> + }
> + ++rsc_held;
> + ha_log(LOG_DEBUG, "I got resource: %s", buf);
> }
> }
>
> @@ -1793,7 +1936,12 @@
> cluster_already_active = 0;
> we_are_primary = 1;
> }
> -
> +
> + sprintf(srsc_held, "%d", rsc_held);
> + if (ha_msg_add(resources_held, F_NRES, srsc_held) == HA_FAIL) {
> + ha_log(LOG_ERR, "no memory to list resource");
> + }
> +
> rc=pclose(rkeys);
> if (rc < 0 && errno != ECHILD) {
> ha_perror("pclose(%s) returned %d", cmd, rc);
> diff -ruN /home/lclaudio/linux-ha/heartbeat/heartbeat.h linux-ha/heartbeat/heartbeat.h
> --- /home/lclaudio/linux-ha/heartbeat/heartbeat.h Wed Apr 12 20:03:49 2000
> +++ linux-ha/heartbeat/heartbeat.h Thu Apr 13 16:10:07 2000
> @@ -288,6 +288,7 @@
> extern void ha_log(int priority, const char * fmt, ...);
> extern void ha_perror(const char * fmt, ...);
> extern int send_local_starting(void);
> +extern int send_resources_held(void);
> extern int send_local_status(void);
> extern int set_local_status(const char * status);
> extern int send_cluster_msg(struct ha_msg*msg);


I hope this is of some use to you.

-dg

--
David Gould dgould@suse.com
If simplicity worked, the world would be overrun with insects.