Mailing List Archive: crmd (?) becomes unresponsive

Hi all,

I'm experiencing difficulties with my 2-node cluster and I'm running
out of ideas about how to fix this. I'd be glad if someone here
could point me to the right direction.

As said, it's a 2-node cluster, running with openSUSE 13.1 and the
HA-Factory packages:

cluster-glue: 1.0.12-rc1 (b5f1605097857b8b96bd517282ab300e2ad7af99)
resource-agents: # Build version:
f725724964882a407f7f33a97124da07a2b28d5d
CRM Version: 1.1.10+git20140117.a3cda76-102.1
(1.1.10+git20140117.a3cda76)
pacemaker 1.1.10+git20140117.a3cda76-102.1 -
network:ha-clustering:Factory / openSUSE_13.1 x86_64
libpacemaker3 1.1.10+git20140117.a3cda76-102.1 -
network:ha-clustering:Factory / openSUSE_13.1 x86_64
corosync 2.3.2-48.1 - network:ha-clustering:Factory / openSUSE_13.1
x86_64
libcorosync4 2.3.2-48.1 - network:ha-clustering:Factory / openSUSE_13.1
x86_64
resource-agents 3.9.5-63.1 - network:ha-clustering:Factory /
openSUSE_13.1 x86_64
cluster-glue 1.0.12-0.rc1.69.1 - network:ha-clustering:Factory /
openSUSE_13.1 x86_64
libglue2 1.0.12-0.rc1.69.1 - network:ha-clustering:Factory /
openSUSE_13.1 x86_64
ldirectord 3.9.5-63.1 - network:ha-clustering:Factory / openSUSE_13.1
x86_64

Both nodes have two NIC's, one is connected to the world and the other
one
connects both nodes with a crossover cable. An internal subnet is used
here
and my /etc/hosts files are fine:

127.0.0.1 localhost
10.0.0.1 s00201.ser4.de s00201
10.0.0.2 s00202.ser4.de s00202

Corosync is configured with udpu and the firewall does not block any
traffic between the interal NIC's.

My cluster is up and running, both nodes are providing some services,
filesystems are mirrored by drbd and the world is a happy place. :-)
The cluster uses a valid and available DC. Editing and executing actions
on resources is usually working fine.

Sometimes, when I run a command like "crm resource migrate grp_nginx"
or just a "crm resource cleanup pri_svc_varnish" it may happen that
those
commands don't return but timeout after a while. At this state even
a "crmadmin -D" does not return.

This happened a lot of times in the last days (I migrated to openSUSE
13.1 last week),
so I tried different things to clear the problem, but nothing seems to
work.
I may happen that the STONITH mechanism is executed for one of the
nodes.
Interestingly the other node does not seem to recognize that it's alone
then.
"crm status" sometimes still shows both nodes as "online". In other
cases
it may occur that the second node comes up again after rebooting, but it
doesn't
get found by the first node and appears "offline".
The network connection does not seem to have any problems.
Communication is still possible between the nodes and I can see a lot of
UDP traffic between both nodes.

Most of the times I solve this by booting the "unresponsive" cluster
node, too.
This leads to other problems because my drbd devices become out of sync,
services get stopped and so on. On the other hand, the node does not
seem
to heal itself, so no "crm" actions can successfully be executed.

The last time that I dared to run a "crm" action was yesterday between
18 and 19 o'clock.
I created a full hb_report that should contain all relevant information,
including the pe-input files. I also enabled the debug logging mode for
corosync,
so extended logs are available, too.

I used strace to find out what a simple "crmadmin -D" does. It ends
with:

---------------
uname({sys="Linux", node="s00201", ...}) = 0
uname({sys="Linux", node="s00201", ...}) = 0
uname({sys="Linux", node="s00201", ...}) = 0
uname({sys="Linux", node="s00201", ...}) = 0
uname({sys="Linux", node="s00201", ...}) = 0
futex(0x7fdccdcf2d48, FUTEX_WAKE_PRIVATE, 2147483647) = 0
mmap(NULL, 135168, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1,
0) = 0x7fdccf39f000
socket(PF_LOCAL, SOCK_STREAM, 0) = 3
fcntl(3, F_GETFD) = 0
fcntl(3, F_SETFD, FD_CLOEXEC) = 0
fcntl(3, F_SETFL, O_RDONLY|O_NONBLOCK) = 0
connect(3, {sa_family=AF_LOCAL, sun_path=@"crmd"}, 110) = 0
setsockopt(3, SOL_SOCKET, SO_PASSCRED, [1], 4) = 0
sendto(3, "\377\377\377\377\0\0\0\0\30\0\0\0\0\0\0\0\0\0\2\0\0\0\0\0",
24, MSG_NOSIGNAL, NULL, 0) = 24
setsockopt(3, SOL_SOCKET, SO_PASSCRED, [0], 4) = 0
recvfrom(3, 0x7fff4bc37d10, 12328, 16640, 0, 0) = -1 EAGAIN (Resource
temporarily unavailable)
poll([{fd=3, events=POLLIN}], 1, 4294967295Process 3781 detached
<detached ...>
---------------

(The full log is available)

crmadmin tries to reach crmd, so I also straced the running crmd
process.
There's not much happening here:

---------------
Process 8669 attached
read(22, Process 8669 detached
<detached ...>
---------------

I killed the crmd process and it got restarted automatically (by
pacemakerd?).
After that, strace just shows countless messages like these:

---------------
Process 7856 attached
poll([{fd=20, events=POLLIN}, {fd=22, events=POLLIN}, {fd=21,
events=POLLIN}], 3, 3000) = 2 ([{fd=20, revents=POLLHUP}, {fd=22,
revents=POLLHUP}])
poll([{fd=20, events=POLLIN}, {fd=22, events=POLLIN}, {fd=21,
events=POLLIN}], 3, 3000) = 2 ([{fd=20, revents=POLLHUP}, {fd=22,
revents=POLLHUP}])
poll([{fd=20, events=POLLIN}, {fd=22, events=POLLIN}, {fd=21,
events=POLLIN}], 3, 3000) = 2 ([{fd=20, revents=POLLHUP}, {fd=22,
revents=POLLHUP}])
poll([{fd=20, events=POLLIN}, {fd=22, events=POLLIN}, {fd=21,
events=POLLIN}], 3, 3000) = 2 ([{fd=20, revents=POLLHUP}, {fd=22,
revents=POLLHUP}])
---------------

Almost the same output is shown when strace'ing pacemakerd, cib and
lrmd.
Unfortunately, I can't remember what happens then, but I think that
after restarting crmd
it took some time until the other node was fenced. It came up again
after rebooting,
both nodes found each other, a DC was elected and everything was fine
again.

Another thing that I could not solve is this type of messages:

---------------
pacemaker.service: Got notification message from PID 20309, but
reception only permitted for PID 8663
---------------

The PID's are:

---------------
ps ax|egrep "20309|8663"
8663 ? Ss 0:05 /usr/sbin/pacemakerd -f
20309 ? Ss 0:29 /usr/sbin/httpd2 -DSTATUS -f
/etc/apache2/httpd.conf -c PidFile /var/run//httpd2.pid
---------------

Maybe this is not that important and has nothing to do with my described
problem.

I'd rather not attach the whole hb_report and logging data to this
e-mail,
but if someone would like to have a look at the files, I would send them
directly.

Maybe I missed to look at the right points to figure out what's going
wrong.
Any hint would be welcome. :-)

Thanks for reading!

Regards,
Thomas
_______________________________________________
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems