Mailing List Archive

RFC: On rsyslog output modules and support for batch operations
Hello, world.

I discussed this in private with Rainer, and he suggested me to bring
the discussion here.

I'm already developing an output module for feeding an Oracle database
with rsyslog input. Rainer already committed some patches to the
"oracle" branch, in git. Let me remember that this is highly
experimental, and I'm sending a big semantic change today. But, in
principle, the module does what you'd expect from it: it connects to a
DB, receives a SQL statement via doAction, prepares that statement, runs
it, commits.

It works, but it's way too slow for my needs. As I said when I started
this project, I need to be very fast, to prepare the statement at
connection time, run it many times, and definitely want batch
operations. Say, I want to insert 1000 entries with a single call to the
Oracle interface, then commit.

With what I know now of rsyslog, I can do it more or less like this:

$OmoracleStatementTemplate,"insert into foo(field1, field2, field3)
values(:val1, :val2, :val3)"

which is the statement to prepare by Oracle. This way, I can prepare the
statement at createInstance() time. Then, I can specify the batch size
with something like

$OmoracleBatchSize 1000

With this, also at createInstance() time I can specify that doAction is
called only if there are 1000 entries pending for this selector, like
this:

CODE_STD_STRING_REQUESTparseSelectorAct(batch_size);

The bad part is that rsyslog will deliver to the output module a single
string per entry. So, I'd have to split each entry into its fields as
part of the doAction() code. I'd need some funny separator for each
field, to avoid problems. So far, it can be done. But the configuration
would look like this:

$OmoracleDB logdb
$OmoracleDBUser dbuser
$OmoracleDBPassword dbpassword
$OmoracleStatement "insert into foo(col1, col2) values (:fied1,
:field2)"
$OmoracleBatchSize 1000
$OmoracleFieldSeparator ****

*.* :omoracle:;"%field1%****%field2%"

and make doAction split the fields appropriately.

I bet it works. But it's probably too ugly for users. Cleaner ways may
need deeper changes into rsyslog's API so that the module gets direct
access to each field. That's probably a lot of work and I can't wait for
that.

So, my questions (at last!): Are there any other alternatives? Is this
"ugly" way of working good for other users? Should I keep it for
internal use?

Thanks a lot.
--
Luis Fernando Muñoz Mejías
Luis.Fernando.Munoz.Mejias@cern.ch

_______________________________________________
rsyslog mailing list
http://lists.adiscon.net/mailman/listinfo/rsyslog
http://www.rsyslog.com
Re: RFC: On rsyslog output modules and support for batch operations [ In reply to ]
On Wed, Apr 1, 2009 at 12:02 PM, Luis Fernando Muñoz Mejías
<Luis.Fernando.Munoz.Mejias@cern.ch> wrote:

> It works, but it's way too slow for my needs. As I said when I started
> this project, I need to be very fast, to prepare the statement at
> connection time, run it many times, and definitely want batch
> operations. Say, I want to insert 1000 entries with a single call to the
> Oracle interface, then commit.

..

> which is the statement to prepare by Oracle. This way, I can prepare the
> statement at createInstance() time. Then, I can specify the batch size
> with something like
>
> $OmoracleBatchSize 1000

Another thing you might want to think about is the idea of using a
callback timer, as was outlined for another prospective feature
implementation here: http://www.rsyslog.com/Article334.phtml

The general idea being, while having a batch size is important, if you
don't have some functional timer callback to the output module, you
will end up in the situation of not flushing regularly. On
lower-traffic outputs, this would reduce the risk of losing a lot of
data. So you could have two different mechanisms:

- A high-watermark batch commit
- A timed commit in the case that high-watermarks aren't met in a
certain time period.

That way you could commit every.. say, 60 seconds, in the case you
haven't hit your high watermark.

Just some food for thought.

-Aaron
_______________________________________________
rsyslog mailing list
http://lists.adiscon.net/mailman/listinfo/rsyslog
http://www.rsyslog.com
Re: RFC: On rsyslog output modules and support for batch operations [ In reply to ]
Aaron,

Thanks for your feedback.

> Another thing you might want to think about is the idea of using a
> callback timer, as was outlined for another prospective feature
> implementation here: http://www.rsyslog.com/Article334.phtml

I'd like to, indeed. But it's lower priority to me. The volume of the
data sources I'm pushing into Oracle is *really* high. Then, if I stick
to a single big batch, I can let the core do the actual batch
management. Otherwise, I need either to receive things in very small
batches that I concatenate or to call the core to dump its internal
buffer. Both are doable, but require some work I'd rather not do right
now. O:)

> The general idea being, while having a batch size is important, if you
> don't have some functional timer callback to the output module, you
> will end up in the situation of not flushing regularly. On
> lower-traffic outputs, this would reduce the risk of losing a lot of
> data. So you could have two different mechanisms:
>
What I'd suggest is to have several batch sizes for different selectors:

$OmoracleBatchSize 1000
if(large_volume_expression) then :omoracle:;LargeTemplateName

$OmoracleBatchSize 1
if(really_small_volume_expression) then :omoracle:;SmallTemplateName

This way, I don't need a timer to communicate with the core, and
simplify my code. In the worst case scenario, SSH could suddenly stop
working at the entire CERN and I'd lose the last 999 messages. I admit
the critical information on why SSH stopped working is on those 999
messages, but for the moment I accept that risk. ;)

Cheers.
--
Luis Fernando Muñoz Mejías
Luis.Fernando.Munoz.Mejias@cern.ch

_______________________________________________
rsyslog mailing list
http://lists.adiscon.net/mailman/listinfo/rsyslog
http://www.rsyslog.com
Re: RFC: On rsyslog output modules and support for batch operations [ In reply to ]
On Wed, Apr 1, 2009 at 10:02, Luis Fernando Muñoz Mejías
<Luis.Fernando.Munoz.Mejias@cern.ch> wrote:
> Hello, world.

Oi. :) Sorry I'm late to the game.

> It works, but it's way too slow for my needs. As I said when I started
> this project, I need to be very fast, to prepare the statement at
> connection time, run it many times, and definitely want batch
> operations. Say, I want to insert 1000 entries with a single call to the
> Oracle interface, then commit.

Forgive me - my database-performance-fu and oracle-fu are not terribly
strong, I may make a fool of myself here. What is the performance
gain of making a prepared statement over just executing raw
statements? IOW, why choose (please forgive my SQL):

CREATE PROCEDURE zazz AS
insert into foo(field1, field2, field3) values(:val1, :val2, :val3);
SET TRANSACTION;
zazz("foo", "bar", "baz");
zazz("foo1", "bar1", "baz1");
zazz("foo2", "bar2", "baz2");
COMMIT;

-- over

SET TRANSACTION;
INSERT INTO foo(field1, field2, field3) values("foo", "bar", "baz");
INSERT INTO foo(field1, field2, field3) values("foo1", "bar1", "baz1");
INSERT INTO foo(field1, field2, field3) values("foo2", "bar2", "baz2");
COMMIT;

Perhaps that's not even what you're doing. I know there are other
considerations and niceties with procedures, but the latter syntax
would still allow for batched transactions while enabling rsyslog to
do the dirty work of formatting the query and not necessitating
exposure of internal structures.

I confess to being a bit confused as to why the existing output module
interface wasn't readily extending to batching, since I've tended to
see the output modules as more of thin, final-hop proxies. IMHO,
database output modules should still pretty much blindly execute
whatever SQL rsyslog hands them, be that wrapped in a transaction or
not.

That said (and more a question for Rainer), do rsyslog templates have
support for a null character? If so, it may be a more viable approach
for delimiting simple fields than changing the output module API. Of
course the CSV approach works too, but seems easier to break out of
than null-delimiting.
_______________________________________________
rsyslog mailing list
http://lists.adiscon.net/mailman/listinfo/rsyslog
http://www.rsyslog.com
Re: RFC: On rsyslog output modules and support for batch operations [ In reply to ]
RB,

> Oi. :) Sorry I'm late to the game.

Your contribution is appreciated. :)

> Forgive me - my database-performance-fu and oracle-fu are not terribly
> strong, I may make a fool of myself here. What is the performance
> gain of making a prepared statement over just executing raw
> statements?

The statement is parsed only once, so you save the overhead of parsing
and doing an execution plan for each execution, which will be
identical. And I expect to insert hundreds of entries per second. :)

All you have to do is pass the arguments.

> CREATE PROCEDURE zazz AS
> insert into foo(field1, field2, field3) values(:val1, :val2,
> :val3); SET TRANSACTION; zazz("foo", "bar", "baz"); zazz("foo1",
> "bar1", "baz1"); zazz("foo2", "bar2", "baz2"); COMMIT;
>
> -- over
>
> SET TRANSACTION;
> INSERT INTO foo(field1, field2, field3) values("foo",
> "bar", "baz");
> INSERT INTO foo(field1, field2, field3) values("foo1",
> "bar1", "baz1");
> INSERT INTO foo(field1, field2, field3)
> values("foo2", "bar2", "baz2"); COMMIT;
>
With this code, Oracle (any DB, actually) needs to parse each insert,
and then choose the execution plan that looks best once.

What you get by preparing the statement and using batches is that the
client (rsyslog core) will store these triplets:

(foo, bar, baz) (foo1, bar1, baz1) (foo2, bar2, baz2)

and when you've hit a limit (say, you're on (foo1000, bar1000, baz1000))
send them all to the server at once (thus calling only once to doAction,
calling only once to the Oracle interface), who will blindly execute the
statement without wasting a single cycle on parsing or evaluating
execution plans: it's already done.

> Perhaps that's not even what you're doing.

For the moment I'm doing

BEGIN
INSERT INTO foo(field1, field2, field3) values("foo", "bar", "baz");
COMMIT;
BEGIN
INSERT INTO foo(field1, field2, field3) values("foo1", "bar1", "baz1");
COMMIT;

You can already imagine the overhead involved. Actually, all DB-based
modules on rsyslog do the same.

> I know there are other considerations and niceties with procedures,

It's not even a stored procedure, it's on the client doing communicating
many times versus only one with the DB.

> but the latter syntax would still allow for batched transactions while
> enabling rsyslog to do the dirty work of formatting the query and not
> necessitating exposure of internal structures.
>
Indeed, I want rsyslog doing most of the work for me. But the overhead
involved in parsing and evaluating execution plans is unacceptable on my
context. So I'm looking here for the balance between rsyslog doing work
for me and rsyslog performing as good as I need it. Perhaps exposing the
structures is not a good idea, either.

> IMHO, database output modules should still pretty much blindly execute
> whatever SQL rsyslog hands them, be that wrapped in a transaction or
> not.
>
Yes and no. Yes, rsyslog should be the one who tells the statement to be
executed. But there is no need for rsyslog to repeat that statement for
each entry (millions per day). Doing it at initialization time is
enough.

I made a small Python prototype to do something similar to what you
propose, with no batches, but committing each 1000 entries. The speedup
I got by introducing batches was about a factor 50. And the statement
was already prepared.

Cheers.
--
Luis Fernando Muñoz Mejías
Luis.Fernando.Munoz.Mejias@cern.ch

_______________________________________________
rsyslog mailing list
http://lists.adiscon.net/mailman/listinfo/rsyslog
http://www.rsyslog.com
Re: RFC: On rsyslog output modules and support for batch operations [ In reply to ]
> The statement is parsed only once, so you save the overhead of parsing
> and doing an execution plan for each execution, which will be
> identical. And I expect to insert hundreds of entries per second. :)

This is where more extensive database programming experience probably
helps; I was unaware of (but understand the reason for) the additional
overhead. Good to know.

> context. So I'm looking here for the balance between rsyslog doing work
> for me and rsyslog performing as good as I need it. Perhaps exposing the
> structures is not a good idea, either.

Perhaps you could [ab]use the fact that ppString is an array and do
something like ommail does, using more than one string/template when
using a custom subject. What I don't know off the top of my head is
whether this would limit the number of different Oracle outputs you
could connect to.


RB
_______________________________________________
rsyslog mailing list
http://lists.adiscon.net/mailman/listinfo/rsyslog
http://www.rsyslog.com