Mailing List Archive: Antw: crmd (?) becomes unresponsive

Hi!

I cannot really help you, but it proved to be helpful to open a wide "fail -f /var/log/messages" window for every cluster node while issuing the actual commands in another window. Maybe you could also watch the cluster with hawk or crm_mon. My favourite option set is "-1Arf"...

Regards,
Ulrich

>>> Thomas Schulte <thomas@cupracer.de> schrieb am 22.01.2014 um 09:55 in Nachricht
<09d9d36ad571203a5b9b048da373df0e@ser4.de>:
> Hi all,
>
> I'm experiencing difficulties with my 2-node cluster and I'm running
> out of ideas about how to fix this. I'd be glad if someone here
> could point me to the right direction.
>
> As said, it's a 2-node cluster, running with openSUSE 13.1 and the
> HA-Factory packages:
>
> cluster-glue: 1.0.12-rc1 (b5f1605097857b8b96bd517282ab300e2ad7af99)
> resource-agents: # Build version:
> f725724964882a407f7f33a97124da07a2b28d5d
> CRM Version: 1.1.10+git20140117.a3cda76-102.1
> (1.1.10+git20140117.a3cda76)
> pacemaker 1.1.10+git20140117.a3cda76-102.1 -
> network:ha-clustering:Factory / openSUSE_13.1 x86_64
> libpacemaker3 1.1.10+git20140117.a3cda76-102.1 -
> network:ha-clustering:Factory / openSUSE_13.1 x86_64
> corosync 2.3.2-48.1 - network:ha-clustering:Factory / openSUSE_13.1
> x86_64
> libcorosync4 2.3.2-48.1 - network:ha-clustering:Factory / openSUSE_13.1
> x86_64
> resource-agents 3.9.5-63.1 - network:ha-clustering:Factory /
> openSUSE_13.1 x86_64
> cluster-glue 1.0.12-0.rc1.69.1 - network:ha-clustering:Factory /
> openSUSE_13.1 x86_64
> libglue2 1.0.12-0.rc1.69.1 - network:ha-clustering:Factory /
> openSUSE_13.1 x86_64
> ldirectord 3.9.5-63.1 - network:ha-clustering:Factory / openSUSE_13.1
> x86_64
>
> Both nodes have two NIC's, one is connected to the world and the other
> one
> connects both nodes with a crossover cable. An internal subnet is used
> here
> and my /etc/hosts files are fine:
>
> 127.0.0.1 localhost
> 10.0.0.1 s00201.ser4.de s00201
> 10.0.0.2 s00202.ser4.de s00202
>
> Corosync is configured with udpu and the firewall does not block any
> traffic between the interal NIC's.
>
>
> My cluster is up and running, both nodes are providing some services,
> filesystems are mirrored by drbd and the world is a happy place. :-)
> The cluster uses a valid and available DC. Editing and executing actions
> on resources is usually working fine.
>
> Sometimes, when I run a command like "crm resource migrate grp_nginx"
> or just a "crm resource cleanup pri_svc_varnish" it may happen that
> those
> commands don't return but timeout after a while. At this state even
> a "crmadmin -D" does not return.
>
> This happened a lot of times in the last days (I migrated to openSUSE
> 13.1 last week),
> so I tried different things to clear the problem, but nothing seems to
> work.
> I may happen that the STONITH mechanism is executed for one of the
> nodes.
> Interestingly the other node does not seem to recognize that it's alone
> then.
> "crm status" sometimes still shows both nodes as "online". In other
> cases
> it may occur that the second node comes up again after rebooting, but it
> doesn't
> get found by the first node and appears "offline".
> The network connection does not seem to have any problems.
> Communication is still possible between the nodes and I can see a lot of
> UDP traffic between both nodes.
>
> Most of the times I solve this by booting the "unresponsive" cluster
> node, too.
> This leads to other problems because my drbd devices become out of sync,
> services get stopped and so on. On the other hand, the node does not
> seem
> to heal itself, so no "crm" actions can successfully be executed.
>
> The last time that I dared to run a "crm" action was yesterday between
> 18 and 19 o'clock.
> I created a full hb_report that should contain all relevant information,
> including the pe-input files. I also enabled the debug logging mode for
> corosync,
> so extended logs are available, too.
>
> I used strace to find out what a simple "crmadmin -D" does. It ends
> with:
>
> ---------------
> uname({sys="Linux", node="s00201", ...}) = 0
> uname({sys="Linux", node="s00201", ...}) = 0
> uname({sys="Linux", node="s00201", ...}) = 0
> uname({sys="Linux", node="s00201", ...}) = 0
> uname({sys="Linux", node="s00201", ...}) = 0
> futex(0x7fdccdcf2d48, FUTEX_WAKE_PRIVATE, 2147483647) = 0
> mmap(NULL, 135168, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1,
> 0) = 0x7fdccf39f000
> socket(PF_LOCAL, SOCK_STREAM, 0) = 3
> fcntl(3, F_GETFD) = 0
> fcntl(3, F_SETFD, FD_CLOEXEC) = 0
> fcntl(3, F_SETFL, O_RDONLY|O_NONBLOCK) = 0
> connect(3, {sa_family=AF_LOCAL, sun_path=@"crmd"}, 110) = 0
> setsockopt(3, SOL_SOCKET, SO_PASSCRED, [1], 4) = 0
> sendto(3, "\377\377\377\377\0\0\0\0\30\0\0\0\0\0\0\0\0\0\2\0\0\0\0\0",
> 24, MSG_NOSIGNAL, NULL, 0) = 24
> setsockopt(3, SOL_SOCKET, SO_PASSCRED, [0], 4) = 0
> recvfrom(3, 0x7fff4bc37d10, 12328, 16640, 0, 0) = -1 EAGAIN (Resource
> temporarily unavailable)
> poll([{fd=3, events=POLLIN}], 1, 4294967295Process 3781 detached
> <detached ...>
> ---------------
>
> (The full log is available)
>
> crmadmin tries to reach crmd, so I also straced the running crmd
> process.
> There's not much happening here:
>
> ---------------
> Process 8669 attached
> read(22, Process 8669 detached
> <detached ...>
> ---------------
>
> I killed the crmd process and it got restarted automatically (by
> pacemakerd?).
> After that, strace just shows countless messages like these:
>
> ---------------
> Process 7856 attached
> poll([{fd=20, events=POLLIN}, {fd=22, events=POLLIN}, {fd=21,
> events=POLLIN}], 3, 3000) = 2 ([{fd=20, revents=POLLHUP}, {fd=22,
> revents=POLLHUP}])
> poll([{fd=20, events=POLLIN}, {fd=22, events=POLLIN}, {fd=21,
> events=POLLIN}], 3, 3000) = 2 ([{fd=20, revents=POLLHUP}, {fd=22,
> revents=POLLHUP}])
> poll([{fd=20, events=POLLIN}, {fd=22, events=POLLIN}, {fd=21,
> events=POLLIN}], 3, 3000) = 2 ([{fd=20, revents=POLLHUP}, {fd=22,
> revents=POLLHUP}])
> poll([{fd=20, events=POLLIN}, {fd=22, events=POLLIN}, {fd=21,
> events=POLLIN}], 3, 3000) = 2 ([{fd=20, revents=POLLHUP}, {fd=22,
> revents=POLLHUP}])
> ---------------
>
> Almost the same output is shown when strace'ing pacemakerd, cib and
> lrmd.
> Unfortunately, I can't remember what happens then, but I think that
> after restarting crmd
> it took some time until the other node was fenced. It came up again
> after rebooting,
> both nodes found each other, a DC was elected and everything was fine
> again.
>
> Another thing that I could not solve is this type of messages:
>
> ---------------
> pacemaker.service: Got notification message from PID 20309, but
> reception only permitted for PID 8663
> ---------------
>
> The PID's are:
>
> ---------------
> ps ax|egrep "20309|8663"
> 8663 ? Ss 0:05 /usr/sbin/pacemakerd -f
> 20309 ? Ss 0:29 /usr/sbin/httpd2 -DSTATUS -f
> /etc/apache2/httpd.conf -c PidFile /var/run//httpd2.pid
> ---------------
>
> Maybe this is not that important and has nothing to do with my described
> problem.
>
>
>
> I'd rather not attach the whole hb_report and logging data to this
> e-mail,
> but if someone would like to have a look at the files, I would send them
> directly.
>
> Maybe I missed to look at the right points to figure out what's going
> wrong.
> Any hint would be welcome. :-)
>
>
> Thanks for reading!
>
> Regards,
> Thomas
> _______________________________________________
> Linux-HA mailing list
> Linux-HA@lists.linux-ha.org
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems

_______________________________________________
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems