Mailing List Archive

abort cases
Hi,

just an update. While I have been silent, I am looking at the aborts
that were recently reported. I think they are related to threading
issues. I am reviewing some code and conducting some tests. This
unfortunately takes some time. I just wanted to let you know that
something is happening ;)

Interestingly, this seems to play nicely with the performance
optimization effort. It may even be that removing the cause could
improve performance, but it is too early to be sure it will... I have
not yet identified a sure cause.

I will keep you updated and post questions when I have them.

Thanks,
Rainer
abort cases [ In reply to ]
On Wed, 17 Sep 2008, Rainer Gerhards wrote:

> Hi,
>
> just an update. While I have been silent, I am looking at the aborts
> that were recently reported. I think they are related to threading
> issues. I am reviewing some code and conducting some tests. This
> unfortunately takes some time. I just wanted to let you know that
> something is happening ;)
>

In the meanwhile, I've compiled once more with --enable-valgrind and I'm
running the daemon under valgrind. Since with no other change the daemon
is still running, while it usually didn't last more than 12 hours, it
seems the timing variation does somewhat cure the symptoms.

>
> Interestingly, this seems to play nicely with the performance
> optimization effort. It may even be that removing the cause could
> improve performance, but it is too early to be sure it will... I have
> not yet identified a sure cause.
>
> I will keep you updated and post questions when I have them.
>

We are all waiting for you questions... in the hope we can be of any help!

Thank you, yours

lorenzo
abort cases [ In reply to ]
> > just an update. While I have been silent, I am looking at the aborts
> > that were recently reported. I think they are related to threading
> > issues. I am reviewing some code and conducting some tests. This
> > unfortunately takes some time. I just wanted to let you know that
> > something is happening ;)
> >
>
> In the meanwhile, I've compiled once more with --enable-valgrind and
> I'm
> running the daemon under valgrind. Since with no other change the
> daemon
> is still running, while it usually didn't last more than 12 hours, it
> seems the timing variation does somewhat cure the symptoms.

This is what I am thinking, too. I am running it under helgrind. If you
do that, you'll see a couple of warnings, most have been found to be
harmless by previous analysis. Anyhow, I am re-doing the analysis now
(what takes quite a bit of time). I am just telling you so that you
don't wonder when you see so many warnings. I think there is even a
discussion thread somewhere telling why this is. Mostly it is cases
where we do not really need sync... or so I thought ;)

Rainer
abort cases [ In reply to ]
>> is still running, while it usually didn't last more than 12 hours, it
>> seems the timing variation does somewhat cure the symptoms.
>
> This is what I am thinking, too. I am running it under helgrind. If you
> do that, you'll see a couple of warnings, most have been found to be
> harmless by previous analysis. Anyhow, I am re-doing the analysis now
> (what takes quite a bit of time). I am just telling you so that you
> don't wonder when you see so many warnings. I think there is even a
> discussion thread somewhere telling why this is. Mostly it is cases
> where we do not really need sync... or so I thought ;)
>

Just to help me understand, do you think val-grinding this time is
near-to-useless, and I'd better restart the daemon under helgrind, or you
prefer I continue valgrinding while you helgrind?

I feared uttering the words "race condition" would have been doing just
like the patient telling the doctor what he does want to hear as a
dyagnosis... now that I see you are looking for missing syncs I think we
share this gut feeling... Please let me know if I can help more, since
I'm somewhat in the hope this dreaded shared memory 8 way system could
very well shake the races... As a matter of fact, the destination server
is a twin-brother of the source one, and is running (and logging to
postgresql) without any hiccup since started-up.

Thank you very much, yours,

lorenzo
abort cases [ In reply to ]
On Wed, 2008-09-17 at 19:36 +0200, Lorenzo M. Catucci wrote:
> >> is still running, while it usually didn't last more than 12 hours, it
> >> seems the timing variation does somewhat cure the symptoms.
> >
> > This is what I am thinking, too. I am running it under helgrind. If you
> > do that, you'll see a couple of warnings, most have been found to be
> > harmless by previous analysis. Anyhow, I am re-doing the analysis now
> > (what takes quite a bit of time). I am just telling you so that you
> > don't wonder when you see so many warnings. I think there is even a
> > discussion thread somewhere telling why this is. Mostly it is cases
> > where we do not really need sync... or so I thought ;)
> >
>
> Just to help me understand, do you think val-grinding this time is
> near-to-useless, and I'd better restart the daemon under helgrind, or you
> prefer I continue valgrinding while you helgrind?

From a pragmatical POV: if running it under valgrind helps ease the
immediate problem, please do that. I suspect valgrind will not report
anything, but if it does, that would probably be very interesting.

Using valgrind's helgrind tool does not make sense to you, because it
emits lots of message, which need to be interpreted by someone with deep
knowledge of the code. This is what I am looking at.

> I feared uttering the words "race condition" would have been doing just
> like the patient telling the doctor what he does want to hear as a
> dyagnosis... now that I see you are looking for missing syncs I think we
> share this gut feeling...

Yes, definitely. There is also one not directly technical reason that
makes me believe this: we had almost no serious issue with rsyslog since
the highly parallel multi-threading engine was introduced. I was a bit
astonished, because doing such a complex beast absolutely correctly in
the first place - even with the lots of testing it received in my
environment - is something I only very seldomly hear about. Now, out of
the sudden, multiple bug reports, all pointing into the same direction
come in. I now conclude that many folks were hesitant to actually use
the new version and now that time has passed, deployments into high
demand environments begin. Unfortunately, I do not have funding for the
high end machines and the amount of time required to do all these high
end testing after the release was finished (and even if I had, I'd
probably still not seen one issue or the other). So it sounds somewhat
logical to me that we have now begun to actually do the firedrill for
the new highly parallel processing part. And that, of course, points
into issues with thread execution order, aka race conditions ;) This
direction would also explain why the issues did not come up earlier
(with reasoning given above).

> Please let me know if I can help more, since
> I'm somewhat in the hope this dreaded shared memory 8 way system could
> very well shake the races... As a matter of fact, the destination server
> is a twin-brother of the source one, and is running (and logging to
> postgresql) without any hiccup since started-up.

All you can do at this time is being patient. I am reviewing the code.
The worker thread pools and all this logic is highly complex. I must
make sure that I have again a very tight grip on it, review all the
subtle cases I came along and then look at the helgrind output and other
diagnostic sources to see where the issue is. Probably then it would be
useful to have a gdb backtrace of an aborted process (or the more the
better), but my experience with these kinds of problems is that good
analysis is more likely to solve them than any captured real-time data.

As soon as I have a question, I'll post. Should you notice something
that you find interesting, please post. Ignoring something is easy, not
knowing something that may potentially help would be bad ;)

Rainer
abort cases [ In reply to ]
> Please let me know if I can help more, since
> I'm somewhat in the hope this dreaded shared memory 8 way system could
> very well shake the races... As a matter of fact, the destination server
> is a twin-brother of the source one, and is running (and logging to
> postgresql) without any hiccup since started-up.

Oh, I forgot, there is one thing you could actually do: I do not intend
to do any threading changes to the current stable. This may sound silly,
because it looks like a bug, but I expect that a lot of changes need to
be done, especially to make helgrind happy (so that it can be used). The
amount of changes I expect are far too intrusive for the stable. After
all, the majority of folks do not experience the bug. So the "fix" will
become part of the current devel.

As such, it would be very useful for me if you could try out the latest
devel versions as I modify them. It would be really great if you could
pull them from my git archive at git.adiscon.com/git/rsyslog.git

The same to everyone else interested in this work ;)

Rainer
abort cases [ In reply to ]
Please also read this:

http://kb.monitorware.com/post13847.html#p13847

It has some other interesting findings which leads me to believe I have
really found the culprit (or at least one of them ;)).

Feedback appreciated.

Rainer

> -----Original Message-----
> From: rsyslog-bounces at lists.adiscon.com [mailto:rsyslog-
> bounces at lists.adiscon.com] On Behalf Of Rainer Gerhards
> Sent: Thursday, September 18, 2008 9:35 AM
> To: rsyslog-users
> Subject: Re: [rsyslog] abort cases
>
> > Please let me know if I can help more, since
> > I'm somewhat in the hope this dreaded shared memory 8 way system
> could
> > very well shake the races... As a matter of fact, the destination
> server
> > is a twin-brother of the source one, and is running (and logging to
> > postgresql) without any hiccup since started-up.
>
> Oh, I forgot, there is one thing you could actually do: I do not
intend
> to do any threading changes to the current stable. This may sound
> silly,
> because it looks like a bug, but I expect that a lot of changes need
to
> be done, especially to make helgrind happy (so that it can be used).
> The
> amount of changes I expect are far too intrusive for the stable. After
> all, the majority of folks do not experience the bug. So the "fix"
will
> become part of the current devel.
>
> As such, it would be very useful for me if you could try out the
latest
> devel versions as I modify them. It would be really great if you could
> pull them from my git archive at git.adiscon.com/git/rsyslog.git
>
> The same to everyone else interested in this work ;)
>
> Rainer
>
> _______________________________________________
> rsyslog mailing list
> http://lists.adiscon.net/mailman/listinfo/rsyslog
abort cases [ In reply to ]
I just tried compiling and running the helgrind branch. To let configure
finish its work, I've had to delete the PKG_CHECK_MODULES lines for RELP
and LIBLOGGING.

Worst thing is that the daemon crashed once more; I'm enclosing the whole
backtrace, that I've got via:

$ gdb /usr/sbin/rsyslogd /var/core-dumps/rsyslogd.core. > rsyslog_helgrind.bt << __eoc__
thread apply all bt
__eoc__

I don't know if the last line of failover.log could be of any help:

syslogd: msg.c:274: msgDestruct: Assertion `(unsigned) ((obj_t*)
(pThis))->iObjCooCKiE == (unsigned) 0xBADEFEE' failed.

Time for helgrind or valgrind?

Thank you, yours

lorenzo


+-------------------------+----------------------------------------------+
| Lorenzo M. Catucci | Centro di Calcolo e Documentazione |
| catucci at ccd.uniroma2.it | Universit? degli Studi di Roma "Tor Vergata" |
| | Via O. Raimondo 18 ** I-00173 ROMA ** ITALY |
| Tel. +39 06 7259 2255 | Fax. +39 06 7259 2125 |
+-------------------------+----------------------------------------------+
-------------- next part --------------
GNU gdb 6.8-debian
Copyright (C) 2008 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu"...
Reading symbols from /usr/lib/libz.so.1...done.
Loaded symbols for /usr/lib/libz.so.1
Reading symbols from /lib/libpthread.so.0...Reading symbols from /usr/lib/debug/lib/libpthread-2.7.so...done.
done.
Loaded symbols for /lib/libpthread.so.0
Reading symbols from /lib/libdl.so.2...Reading symbols from /usr/lib/debug/lib/libdl-2.7.so...done.
done.
Loaded symbols for /lib/libdl.so.2
Reading symbols from /lib/librt.so.1...Reading symbols from /usr/lib/debug/lib/librt-2.7.so...done.
done.
Loaded symbols for /lib/librt.so.1
Reading symbols from /lib/libc.so.6...Reading symbols from /usr/lib/debug/lib/libc-2.7.so...done.
done.
Loaded symbols for /lib/libc.so.6
Reading symbols from /lib/ld-linux-x86-64.so.2...Reading symbols from /usr/lib/debug/lib/ld-2.7.so...done.
done.
Loaded symbols for /lib64/ld-linux-x86-64.so.2
Reading symbols from /usr/lib/rsyslog/lmnet.so...done.
Loaded symbols for /usr/lib/rsyslog/lmnet.so
Reading symbols from /lib/libnss_files.so.2...Reading symbols from /usr/lib/debug/lib/libnss_files-2.7.so...done.
done.
Loaded symbols for /lib/libnss_files.so.2
Reading symbols from /usr/lib/rsyslog/imuxsock.so...done.
Loaded symbols for /usr/lib/rsyslog/imuxsock.so
Reading symbols from /usr/lib/rsyslog/imklog.so...done.
Loaded symbols for /usr/lib/rsyslog/imklog.so
Reading symbols from /lib/libnss_compat.so.2...Reading symbols from /usr/lib/debug/lib/libnss_compat-2.7.so...done.
done.
Loaded symbols for /lib/libnss_compat.so.2
Reading symbols from /lib/libnsl.so.1...Reading symbols from /usr/lib/debug/lib/libnsl-2.7.so...done.
done.
Loaded symbols for /lib/libnsl.so.1
Reading symbols from /lib/libnss_nis.so.2...Reading symbols from /usr/lib/debug/lib/libnss_nis-2.7.so...done.
done.
Loaded symbols for /lib/libnss_nis.so.2
Reading symbols from /usr/lib/rsyslog/lmnetstrms.so...done.
Loaded symbols for /usr/lib/rsyslog/lmnetstrms.so
Reading symbols from /usr/lib/rsyslog/lmtcpclt.so...done.
Loaded symbols for /usr/lib/rsyslog/lmtcpclt.so
Reading symbols from /usr/lib/rsyslog/lmnsd_ptcp.so...done.
Loaded symbols for /usr/lib/rsyslog/lmnsd_ptcp.so
Core was generated by `rsyslogd -c3'.
Program terminated with signal 6, Aborted.
[New process 2747]
[New process 2749]
[New process 2748]
[New process 2746]
[New process 2745]
#0 0x00002ad35493eef5 in raise () from /lib/libc.so.6
(gdb)
Thread 5 (process 2745):
#0 0x00002ad3549d5f12 in select () from /lib/libc.so.6
#1 0x000000000040db33 in mainThread () at syslogd.c:2585
#2 0x000000000040ed0c in realMain (argc=<value optimized out>,
argv=<value optimized out>) at syslogd.c:3483
#3 0x00002ad35492b1a6 in __libc_start_main () from /lib/libc.so.6
#4 0x0000000000409f79 in _start ()

Thread 4 (process 2746):
#0 0x00002ad3549c4d17 in sched_yield () from /lib/libc.so.6
#1 0x0000000000431844 in wtiWorker (pThis=0x685200) at wti.c:371
#2 0x00000000004301da in wtpWorker (arg=0x685200) at wtp.c:419
#3 0x00002ad3542e9fc7 in start_thread () from /lib/libpthread.so.0
#4 0x00002ad3549dc7cd in clone () from /lib/libc.so.6
#5 0x0000000000000000 in ?? ()

Thread 3 (process 2748):
#0 0x00002ad3549d5f12 in select () from /lib/libc.so.6
#1 0x00002ad35507b9dd in runInput (pThrd=<value optimized out>)
at imuxsock.c:278
#2 0x0000000000441eef in thrdStarter (arg=0x6a6700) at ../threads.c:139
#3 0x00002ad3542e9fc7 in start_thread () from /lib/libpthread.so.0
#4 0x00002ad3549dc7cd in clone () from /lib/libc.so.6
#5 0x0000000000000000 in ?? ()

Thread 2 (process 2749):
#0 0x00002ad3542f07eb in read () from /lib/libpthread.so.0
#1 0x00002ad3552801bf in klogLogKMsg () at linux.c:449
#2 0x00002ad35527f554 in runInput (pThrd=0x6a7b90) at imklog.c:214
#3 0x0000000000441eef in thrdStarter (arg=0x6a7b90) at ../threads.c:139
#4 0x00002ad3542e9fc7 in start_thread () from /lib/libpthread.so.0
#5 0x00002ad3549dc7cd in clone () from /lib/libc.so.6
#6 0x0000000000000000 in ?? ()

Thread 1 (process 2747):
#0 0x00002ad35493eef5 in raise () from /lib/libc.so.6
#1 0x00002ad354940413 in abort () from /lib/libc.so.6
#2 0x0000000000422019 in sigsegvHdlr (signum=6) at debug.c:758
#3 <signal handler called>
#4 0x00002ad35493eef5 in raise () from /lib/libc.so.6
#5 0x00002ad354940413 in abort () from /lib/libc.so.6
#6 0x00002ad354937dc9 in __assert_fail () from /lib/libc.so.6
#7 0x000000000041bad6 in msgDestruct (ppThis=0x6905f8) at msg.c:274
#8 0x0000000000441824 in actionCallAction (pAction=0x690580, pMsg=0x6aeb70)
at ../action.c:741
#9 0x000000000040afb7 in processMsgDoActions (pData=0x690580,
pParam=0x41000e90) at syslogd.c:1117
#10 0x000000000041c938 in llExecFunc (pThis=0x6903e0,
pFunc=0x40af60 <processMsgDoActions>, pParam=0x41000e90)
at linkedlist.c:391
#11 0x000000000040ab19 in msgConsumer (notNeeded=<value optimized out>,
pUsr=0x6aeb70) at syslogd.c:1160
#12 0x000000000043ae97 in queueConsumerReg (pThis=0x690980, pWti=0x6a4600,
iCancelStateSave=<value optimized out>) at queue.c:1574
#13 0x0000000000431970 in wtiWorker (pThis=0x6a4600) at wti.c:413
#14 0x00000000004301da in wtpWorker (arg=0x6a4600) at wtp.c:419
#15 0x00002ad3542e9fc7 in start_thread () from /lib/libpthread.so.0
#16 0x00002ad3549dc7cd in clone () from /lib/libc.so.6
#17 0x0000000000000000 in ?? ()
(gdb) quit
Re: abort cases [ In reply to ]
> I don't know if the last line of failover.log could be of any help:
>
> syslogd: msg.c:274: msgDestruct: Assertion `(unsigned) ((obj_t*)
> (pThis))->iObjCooCKiE == (unsigned) 0xBADEFEE' failed.
>
Interesting...

> Time for helgrind or valgrind?

Could you run it in debug mode (-dn) and send me the log file? Maybe it
has some interesting findings in it.

On valgrind and helgrind: both is fine. If the problem is reproducible,
first run it under the memory checker and then the thread checker.
Helgrind will report a couple (20 or so) exceptions at the END of
program run, that is fine. Any other exception - in either of both - is
interesting and probably helpful.

Rainer
_______________________________________________
rsyslog mailing list
http://lists.adiscon.net/mailman/listinfo/rsyslog
Re: abort cases [ In reply to ]
OK, I (thankfully) was able to see a potential race in my test environment that actually fits very well to the assert message you see. In my environment, it happens very (very!) seldom, even helgrind didn't notice until now. Please wait until I create a patch for it, I do NOT need any further info in the meantime.

Rainer

> -----Original Message-----
> From: Lorenzo M. Catucci [mailto:lorenzo@sancho.ccd.uniroma2.it]
> Sent: Thursday, September 18, 2008 7:13 PM
> To: Rainer Gerhards
> Cc: rsyslog-users
> Subject: Re: [rsyslog] abort cases
>
> I just tried compiling and running the helgrind branch. To let
> configure
> finish its work, I've had to delete the PKG_CHECK_MODULES lines for
> RELP
> and LIBLOGGING.
>
> Worst thing is that the daemon crashed once more; I'm enclosing the
> whole
> backtrace, that I've got via:
>
> $ gdb /usr/sbin/rsyslogd /var/core-dumps/rsyslogd.core. >
> rsyslog_helgrind.bt << __eoc__
> thread apply all bt
> __eoc__
>
> I don't know if the last line of failover.log could be of any help:
>
> syslogd: msg.c:274: msgDestruct: Assertion `(unsigned) ((obj_t*)
> (pThis))->iObjCooCKiE == (unsigned) 0xBADEFEE' failed.
>
> Time for helgrind or valgrind?
>
> Thank you, yours
>
> lorenzo
>
>
> +-------------------------+--------------------------------------------
> --+
> | Lorenzo M. Catucci | Centro di Calcolo e Documentazione
> |
> | catucci@ccd.uniroma2.it | Università degli Studi di Roma "Tor
> Vergata" |
> | | Via O. Raimondo 18 ** I-00173 ROMA **
> ITALY |
> | Tel. +39 06 7259 2255 | Fax. +39 06 7259 2125
> |
> +-------------------------+--------------------------------------------
> --+
_______________________________________________
rsyslog mailing list
http://lists.adiscon.net/mailman/listinfo/rsyslog
Re: abort cases [ In reply to ]
Lorenzo,

I would appreciate if you could run the latest code from "helgrind" branch. I have created one patch. It is probably not a 100% solution, but I won't manage to do much more today. And so it would be very helpful for me to know if something changes in your environment.

Thanks,
Rainer

> -----Original Message-----
> From: rsyslog-bounces@lists.adiscon.com [mailto:rsyslog-
> bounces@lists.adiscon.com] On Behalf Of Rainer Gerhards
> Sent: Friday, September 19, 2008 4:00 PM
> To: Lorenzo M. Catucci
> Cc: rsyslog-users
> Subject: Re: [rsyslog] abort cases
>
> OK, I (thankfully) was able to see a potential race in my test
> environment that actually fits very well to the assert message you see.
> In my environment, it happens very (very!) seldom, even helgrind didn't
> notice until now. Please wait until I create a patch for it, I do NOT
> need any further info in the meantime.
>
> Rainer
>
> > -----Original Message-----
> > From: Lorenzo M. Catucci [mailto:lorenzo@sancho.ccd.uniroma2.it]
> > Sent: Thursday, September 18, 2008 7:13 PM
> > To: Rainer Gerhards
> > Cc: rsyslog-users
> > Subject: Re: [rsyslog] abort cases
> >
> > I just tried compiling and running the helgrind branch. To let
> > configure
> > finish its work, I've had to delete the PKG_CHECK_MODULES lines for
> > RELP
> > and LIBLOGGING.
> >
> > Worst thing is that the daemon crashed once more; I'm enclosing the
> > whole
> > backtrace, that I've got via:
> >
> > $ gdb /usr/sbin/rsyslogd /var/core-dumps/rsyslogd.core. >
> > rsyslog_helgrind.bt << __eoc__
> > thread apply all bt
> > __eoc__
> >
> > I don't know if the last line of failover.log could be of any help:
> >
> > syslogd: msg.c:274: msgDestruct: Assertion `(unsigned) ((obj_t*)
> > (pThis))->iObjCooCKiE == (unsigned) 0xBADEFEE' failed.
> >
> > Time for helgrind or valgrind?
> >
> > Thank you, yours
> >
> > lorenzo
> >
> >
> > +-------------------------+------------------------------------------
> --
> > --+
> > | Lorenzo M. Catucci | Centro di Calcolo e Documentazione
> > |
> > | catucci@ccd.uniroma2.it | Università degli Studi di Roma "Tor
> > Vergata" |
> > | | Via O. Raimondo 18 ** I-00173 ROMA **
> > ITALY |
> > | Tel. +39 06 7259 2255 | Fax. +39 06 7259 2125
> > |
> > +-------------------------+------------------------------------------
> --
> > --+
> _______________________________________________
> rsyslog mailing list
> http://lists.adiscon.net/mailman/listinfo/rsyslog
_______________________________________________
rsyslog mailing list
http://lists.adiscon.net/mailman/listinfo/rsyslog
Re: abort cases [ In reply to ]
As promised, I've just logged in to the server, to check if the daemon was
still on. I found a 33M core dump; I'm enclosing the backtrace here.

Have a nice week-end,

lorenzo

+-------------------------+----------------------------------------------+
| Lorenzo M. Catucci | Centro di Calcolo e Documentazione |
| catucci@ccd.uniroma2.it | Università degli Studi di Roma "Tor Vergata" |
| | Via O. Raimondo 18 ** I-00173 ROMA ** ITALY |
| Tel. +39 06 7259 2255 | Fax. +39 06 7259 2125 |
+-------------------------+----------------------------------------------+