Mailing List Archive: Use of application-level acks in RELP.

Hi Ray,

thanks for your excellent questions. I've also made a blog post out of
them, as I think this needs some better visibility (and can be used for
future reference). Just if you are curios:

http://blog.gerhards.net/2009/01/use-of-application-level-acks-in-relp.html

(no need to read, all answers are inline below)

On Wed, 2009-01-14 at 04:50 -0700, Ray Whitmer wrote:
> In my research of rsyslog to determine its suitability for a
> particular situation I have some questions left unanswered. I need
> relatively-guaranteed delivery. I will continue to review the
> available info including source code to see if I can answer the
> questions, but I hope it may be productive to ask questions here.
>
> In the documentation, you describe the situation where syslog silently
> loses tcp messages, not because the tcp protocol permits it but
> because the send function returns after delivering the message to a
> local buffer before it is actually delivered.
>
> But there is a more-fundamental reason an application-level ack is
> required. An application can fail (someone trips over the power cord)
> between when the application receives the data and when it records it.
>
> 1. Does rsyslog send the ack in the RELP protocol occur after the
> message has been safely recorded in whatever queue has been configured
> or forwarded on so its delivery status is as safe as it will get (of
> course how safe depends upon options chosen), or was it only intended
> to solve the case of TCP buffering-based unreliability?

RELP is designed to provide end-to-end reliability. The TCP buffering
issue is just highlighted because it is so subtle that most people tend
to overlook it. An application abort seems to be more obvious and RELP
handles that.

HOWEVER, that does not mean messages are necessarily recorded when the
ACK is sent. It depends on the configuration. In RELP, the
acknowledgment is sent after the reception callback has been called.
This can be seen in the relevant RELP module. For rsyslog's imrelp, this
means the callback returns after the message has been enqueued in the
main message queue.

It now depends on how that queue is configured. By default, messages are
buffered in main memory. So when rsyslog aborts for some reason (or is
terminated by user request) before this message is being processed, it
is lost - while the sender still got a positive ACK. This is how things
are done by default, and it is useful for many scenarios. Of course, it
does not provide the audit-grade reliability that RELP aims for. But the
default config needs to take care of the usual use case and this is not
audit-grade reliablity (just think of the numerous home systems that run
rsyslog and should do so in the least intrusive way).

If you are serious about your logs, you need to configure the engine to
be fully reliable. The most important thing is a good understanding of
the queue engine. You need to read and understand the rsyslog queue
( http://www.rsyslog.com/doc-queues.html ) docs, as they form the basis
on which reliability can be built.

The other thing you need to know is your exact requirements. Asking for
reliability is easy, implementing it is not. The more you near 100%
reliability (which you will never reach for one reason or the other) the
more complex scenarios get. I am sure the original post knows quite well
what he want, but I am often approached by people who just want to have
it "totally reliable" ... but don't want to spent the fortune it
requires (really - ever thought about the redundant data centers, power
plants, satellite and sea links et all you need for that?). So it is
absolutely vital to have good requirements, which also includes of when
loss is acceptable, and at what cost this comes.

Once you have these requirements, a rsyslog configuration that matches
them can be designed.

At this point, I'd like to note that it may also be useful to consider
rsyslog professional services
( http://www.rsyslog.com/doc-professional_support.html ) as it provides
valuable aid during design and probably deployment of a solution (I
can't go into the full depth of enterprise requirements here).

To go back to the original question: RELP has almost everything that is
needed, but configuring the whole system in an audit-grade way requires
(ample) work.

> 2. Presumably there is a client API that speaks RELP. Can it be
> configured to return an error to the client if there is no ACK (i.e.
> if the log it sent did not make it into the configured safe location
> which could be on a disk-based queue), or does it only retry? Where is
> this API?

The API is in librelp ( http://www.librelp.com/ ). But actually this is
not what you are looking for. In rsyslog, an output module (here:
omrelp) provides the status back to the caller. Then, configuration
decides what happens. Messages may be discarded, sent to a different
destination or retried.

With omrelp, I think we have some hardcoded ways to preserve the
message, but I have no time yet to look this up in detail. In any case,
RELP will not loose messages but may duplicate few of them (within the
current unacked window) if the remote peer simply dies. Again, this
requires proper configuration of the rsyslog components.

Even with that, you may loose messages if the local rsyslogd dies (not
terminates, but dies for some unexpected reason, e.g. a segfault, kill
-9 or whatever) but still has messages in a not persisted queue. Again,
this can be mitigated by proper configuration, but that must be
designed. Also, it is very costly in terms of performance. A good
reading on the subtleties can be in the rsyslog mailing list archive
(http://lists.adiscon.net/pipermail/rsyslog/2008-October/001224.html ).
I suggest to have a look at it.
>
> Certainly the TCP caching case you mention in your pages is one a user
> is more likely to be able to reproduce, but that is all the more
> reason for me to be concerned that the less-reproducible situations
> that could cause a message to occasionally become lost are handled
> correctly.

I don't think app-abort is less reproducable

kill -9 `cat /var/run/rsyslog.pid`

will do nicely. Actually, from feedback I received, many users seem to
understand the implications of a program/system abort. But far fewer
understand the issues inherent in TCP. Thus I am focusing so much on the
later. But of course, everything needs to be considered. Read the thread
about the reliable queue (really!). It goes great lengths, but still
does not offer a full solution. Getting things reliable (or secure) is
very, very challenging and requires in-depth knowledge.

So I am glad you asked and provided an opportunity for this to be
written :)

Rainer

_______________________________________________
rsyslog mailing list
http://lists.adiscon.net/mailman/listinfo/rsyslog
http://www.rsyslog.com