Mailing List Archive

protocol progress
Hi,

I've been testing the heartbeat protocol.

I tested it at 90% error rate for a few minutes, but it became clear that I
wasn't going to learn much useful there, because each machine was constantly
declaring the other dead. I was using a 10-second dead time. Everything was
behaving as it should, the results just didn't look very useful. I did find and
fix one minor bug.

So, I switched to a 50% packet loss rate. Now, each side declared the other
dead from time to time, but the links spent more time in the "live" state than
the "dead" state, so the results looked much more useful.

In spite of the significant packet loss rate the protocol performed exactly
according to plan for the last 30 or so hours. Every packet eventually got
through. The "no packets missing" condition was detected about every two
seconds average over the test.

The observant will note that this means that I had a partitioned cluster
(lots!). At this point, this is expected.

The code that does all these wonderful things is now in CVS.

If someone has a chance to test it, let me know how it goes...


-- Alan Robertson
alanr@bell-labs.com