Mailing List Archive

Re: comms layering
Hi,

On Mon, 31 Jan 2000 14:37:24 -0500, Keith Barrett <kbarrett@redhat.com>
said:

> That was my thought. I think this is a good discussion topic, and I
> know I need help in either understanding why we won't use messaging
> (because we may also get asked by outsiders), or how clustering may
> future impact a personal effort on my part to provide a Message
> Queuing environment for Linux. I honestly don't understand why we
> would not want to layer ourselves from the commuications system. I
> also got the impression that one or two people in our Clustering
> symposium agreed.

> I will, of course, live with whatever Clustering architecture is
> designed.


>> Umm, sockets *are* the Unix transport-independent messaging layer.

> Forgive my ignorance here (I honestly don't know), but do socket calls
> support native ATM?

Yes, they do.

> I know they don't support direct ethernet without IP

Yes, they do, but you need to have root privileges to use SOCK_RAW (as
there are serious security implications in giving normal users raw
access to the wire!).

> (which might have offered some performance opportunities). Do future
> protocols always get included in the socket APIs?

The socket API is generic. You can define any protocol you want by
specifying a new address family and new protocol family constant on the
local system.

> I know from personal experience that socket calls do not work
> universally the same when you cross multiple platforms. For example;
> Some platforms (Motorola or AS400 if I remember correctly) did not let
> you create a pipe larger than 1k bytes; forcing you to deal with your
> own frame disection and reassembly (most other platforms allow a pipe
> at least 32k long).

I'm sold on the advantages. However, IP sockets _are_ a lowest common
denominator. The question that bothers me is, do we want to provide
messaging eventually as an API for clustered applications, or do we want
to provide it now and build the whole cluster infrastructure on top of
it?

The problem is that if we go the latter route, then we can code and test
_nothing_ until the message queues are in place. Using message queues
as infrastructure has its own set of problems, too --- cluster service
daemons which have to participate in recovery necessarily have to have
very different comms behaviour over a cluster transition than cluster
applications running above that level. A lock manager will need to
reset all of its cluster interconnections when recovery occurs. An
application using message queuing, on the other hand, should simply not
notice a cluster transition if it doesn't involve the loss of any node
it is communicating with.

> Some platforms block listeners, others support multiple listen
> connects. Also; not all systems support sockets (some use
> streams).

Streams-based systems almost always provide a socket API in addition.

> There's also an ad-hoc way of using messaging and a proxy Linux
> system to immediately bring incompatible systems (like WinNT) into
> the cluster until someone codes native services for it. This has
> many political advantages.

Windows supports sockets.

> But on their own, these may not be enough of a case for a messaging
> subsystem in clustering (at least, not initially. It does become more
> important when you want non-linux (and especially non-unix) systems
> to join clustering. This is the reason I brought this up. Even DEC,
> when it was looking into non-VMS cluster memberships, was considering
> messaging. If we will be needing a messaging layer eventually,
> perhaps its API should be designed and coded against now to save
> rework?

Right --- it's getting the layering right which matters at the moment.
So, here are some of the questions I think we need to settle before we
can make that judgement:

How much work is involved in implementing enough MQ to get the cluster
services working? How much performance will be lost in its
implementation? How easily can we settle the API? Given that the
reality is that all our expected initial adopters of clustering will be
running exclusively IP, how transparent can we make the message queuing
to them? (We do NOT want to expose a new addressing scheme to IP
users!)

> Do we really trust that sockets are a good universal communuications
> layer?

Yes. Furthermore, in the Unix world, socket support is simply required:
we _have_ to support it so that existing applications can use clustering
easily.

> Do we trust that all future protocols will be sockets enabled?

Yes, although that does not mean that sockets will always be the most
efficient way of accessing future protocols. (But the same is true of
MQ: if you need every last bit of performance over a VIA interconnect,
you'll talk raw VIA in either case).

The question is not whether or not sockets are a good universal layer.
They _are_ pretty much universal, but they are also very much a lowest
common denominator.

The question which matters _right now_ is really how much infrastructure
we build on top of sockets for our initial core cluster services. That
initial core will necessarily be built on top of sockets, but we can add
as much abstraction there as we want in order to make things more
portable for the future.

My own feeling is that the abstraction we need to provide has to include
clean large message support and transparent connection opening. The
semantics _should_ be that we can pick a cluster member and send any
message, treating the messaging layer as a reliable ordered datagram
service but using tcp underneath. A cluster transition should be
visible just as a barrier reset in that layer: connection management
should be abstracted.

That is what is _required_ for the cluster services v1. What we eventually
want to offer to applications in our full cluster infrastructure
services may be very different.

--Stephen
Re: comms layering [ In reply to ]
Hi,

On Mon, 31 Jan 2000 11:13:35 -0800, David Brower <dbrower@us.oracle.com>
said:

> "Stephen C. Tweedie" wrote:
>> ... Sockets are what all existing support
>> libraries, such as xdr/rpc, run on top of. As such, I'd need an
>> overwhelming reason not to use that as the primary communications
> ^^^^^^^
>> architecture for clustering.

> Agreed, with emphasis on "primary"; existence of primary need not
> exclude other, "secondary" methods.

Absolutely. Definitely.

>> Umm, sockets *are* the Unix transport-independent messaging layer. What
>> I've already suggested is that we have a name service which lets us open
>> or connect to a socket by the cluster node ID, rather than by address.
>> That already gives us full protocol independence. We can run over IP,
>> IPX, VIA or whatever if we have that in place, and we don't need to
> ^^^^
>> throw away the socket infrastructure to get it.

> I';ll get off the boat with Stephen on this point, though. VIA is one
> of the transports I'd strongly consider as a "secondary" interface. This
> is because it's strengths are not well-revealed through a socket interface,
> and really cry for other mechanisms to exploit well. If you're willing
> to put up with the socket overhead, the cost of running IP on it is trivial,
> so you might as well just make it another IP stack at that point.

That's not my point at all. If we have a VIA transport, we still want
our cluster services to be able to run, but we also want the
applications to be able to use fast messaging for their own purposes.

We don't need the cluster heartbeats to run on native VIA, for example:
running that over sockets over VIA would be fine. The point is that we
still want our core cluster services to continue to run on VIA even if
they haven't been optimised for that, but still to let other
applications make use of the full power of native VIA APIs if they want
it.

--Stephen
Re: comms layering [ In reply to ]
Hi,

On Mon, 31 Jan 2000 17:55:25 -0500, Keith Barrett <kbarrett@redhat.com>
said:

> Have you already figured out how this is going to look? I'm not
> proposing that message queuing be created and hold up the project. I'm
> suggesting that the API be created that meets both situations (if they
> are indeed very similar), and the initial coding under that API focus
> on clustering.

Once I start looking more closely at the mid-layers of the clustering
services, I'll try to codify an API for what I think we need to offer as
a comms library for recoverable services.

The low-level cluster stuff, though, has very specific communications
needs which can't be met by a general-purpose library. For example, the
design calls for the ability of the low-level code to use "degraded"
links such as serial connections, as fallbacks in case the primary
cluster transport dies, allowing us to gracefully evict one failed node
from the cluster. There are also hard timing guarantees which need to
be provided at the low level. Finally, the service interconnect library
will need to be aware of cluster transitions and so will probably have
to be layered on top of the cluster integration/membership layer, so it
will be hard to use that same library to implement cluster membership
in the first place. :)

All that means that there are a few components of the stack to be filled
out before I'll be at the stage where I'll need to finalise the service
comms lib.

> So much of what you need appears to be standard practice in messaging,
> and parts are not provided for in sockets. This is what I see as your
> core:

> 1. Independent addressing from network protocols, and local in
> appearance
> to cluster nodes and applications (regardless of subnet). In other
> words;
> you want to use them as nodes A, B, and C -- regardless of whether
> they
> are members of the same subnet or not (in which case, UDP or IP
> broadcasts
> will have to be simulated).

[.Umm, please try to restrict your mailer to 80 columns, it makes things
much easier for us dinosaurs who still insist on using text-mode tools
to read email!]

A complete broadcast/multicast infrastructure feels a bit like overkill
for the low-level cluster service interconnect. However, it's probably
worth bearing this in mind when the API is established, so that it is
easy to add multicast later on.

Certain services spring to mind which might be able to benefit from it
--- in particular, things like telling a set of nodes to drop locks on a
given resource might be multicasted. My feeling right now is that it
really isn't useful enough at this level in the stack for us to hold up
implementing the interconnect API.

> 9. Hidden message defrag and reassembly
> 12. Sequencing and ordered message delivery

That is most definitely going to be necessary for the service comms
lib. The other bits here seem to be much more appropriate to high-level
application services, not to a message infrastructure to be used within
the cluster daemons themselves.

--Stephen
Re: comms layering [ In reply to ]
"Stephen C. Tweedie" wrote:
>
> > they
> > are members of the same subnet or not (in which case, UDP or IP
> > broadcasts
> > will have to be simulated).
>
> [.Umm, please try to restrict your mailer to 80 columns, it makes things
> much easier for us dinosaurs who still insist on using text-mode tools
> to read email!]

Actually; that was netscape breaking apart my message before it went out,
without telling me where my margin was :-/

--

Keith