Hi,
I have a problem that looks similar to that reported "possible deadlock
in lrmd" on 21st Oct
When running lradmin -C to list classes the first time it comes back
immediately with the expected list e.g.
root@node1:/home# lrmadmin -C
There are 5 RA classes supported:
lsb
ocf
stonith
upstart
heartbeat
All subsequent attempts lrmadmin hangs and never comes back (you have to kill
with crtl-C). This is repeatable on all the machines I have tried it on.
reboot appears to be the only cure as corosync stop baulks on
Waiting for corosync services to unload:.........
The environment is compiled from source (as per
http://www.clusterlabs.org/wiki/Install#From_Source) on a clean maverick server (both 32 and 64 bit)
Versions are
Cluster-Resource-Agents-051972b5cfd
Pacemaker-1-0-b2e39d318fda
Reusable-Cluster-Components-8658bcdd4511
flatiron - not sure but downloaded Friday 19th
Strace suggests that lrmadmin has stuck on
/var/run.heartbeat/lrm_cmd_sock reporting "resource temporarily
unavailable" but never responds to the outbound message :
17:43:41.328500 connect(3, {sa_family=AF_FILE,
path="/var/run/heartbeat/lrm_cmd_sock"}, 110) = 0
17:43:41.328572 getsockopt(3, SOL_SOCKET, SO_PEERCRED,
"\t\4\0\0\0\0\0\0\0\0\0\0", [12]) = 0
17:43:41.328788 getegid() = 0
17:43:41.328846 getuid() = 0
17:43:41.328970 recvfrom(3, 0x17f1e70, 4048, 64, 0, 0) = -1 EAGAIN
(Resource temporarily unavailable)
17:43:41.329050 poll([{fd=3, events=0}], 1, 0) = 0 (Timeout)
17:43:41.329154 recvfrom(3, 0x17f1e70, 4048, 64, 0, 0) = -1 EAGAIN
(Resource temporarily unavailable)
17:43:41.329202 poll([{fd=3, events=0}], 1, 0) = 0 (Timeout)
17:43:41.329263 sendto(3,
"F\0\0\0\315\253\0\0>>>\nlrm_t=reg\nlrm_app=lr"..., 78,
MSG_DONTWAIT|MSG_NOSIGNAL, NULL, 0) = 78
17:43:41.329337 recvfrom(3, 0x17f1e70, 4048, 64, 0, 0) = -1 EAGAIN
(Resource temporarily unavailable)
17:43:41.329380 poll([{fd=3, events=0}], 1, 0) = 0 (Timeout)
17:43:41.329420 recvfrom(3, 0x17f1e70, 4048, 64, 0, 0) = -1 EAGAIN
(Resource temporarily unavailable)
17:43:41.329458 poll([{fd=3, events=0}], 1, 0) = 0 (Timeout)
17:43:41.329497 recvfrom(3, 0x17f1e70, 4048, 64, 0, 0) = -1 EAGAIN
(Resource temporarily unavailable)
17:43:41.329535 poll([{fd=3, events=0}], 1, 0) = 0 (Timeout)
17:43:41.329574 recvfrom(3, 0x17f1e70, 4048, 64, 0, 0) = -1 EAGAIN
(Resource temporarily unavailable)
17:43:41.329613 poll([{fd=3, events=0}], 1, 0) = 0 (Timeout)
17:43:41.329651 poll([{fd=3, events=POLLIN}], 1, -1 <unfinished ...>
The lrmd process is still alive and there is nothing logged in
/var/log/daemon.log. Its strace implies it never even saw the request on
the socket. The process still has 3 file handles open on it:
root@node1:~# lsof /var/run/heartbeat/lrm_cmd_sock
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
lrmd 1420 root 3u unix 0xffff88001e011040 0t0 8732
/var/run/heartbeat/lrm_cmd_sock
lrmd 1420 root 9u unix 0xffff88001e0b4d00 0t0 8782
/var/run/heartbeat/lrm_cmd_sock
lrmd 1420 root 11u unix 0xffff88001e1a9d40 0t0 10211
/var/run/heartbeat/lrm_cmd_sock
A good strace (ie lradmin -C after a reboot) starts identically to the
strace above but receives a response from lrmd:
...
20:12:48.774239 poll([{fd=3, events=POLLIN}], 1, -1) = 1 ([{fd=3,
revents=POLLIN}])
20:12:48.774603 recvfrom(3, "
\0\0\0\315\253\0\0>>>\nlrm_t=return\nlrm_ret"..., 4048, MSG_DONTWAIT,
NULL, NULL) = 40
20:12:48.774661 poll([{fd=3, events=0}], 1, 0) = 0 (Timeout)
20:12:48.774709 recvfrom(3, 0x1049e98, 4008, 64, 0, 0) = -1 EAGAIN
(Resource temporarily unavailable)
20:12:48.774756 poll([{fd=3, events=0}], 1, 0) = 0 (Timeout)
20:12:48.774804 recvfrom(3, 0x1049e98, 4008, 64, 0, 0) = -1 EAGAIN
(Resource temporarily unavailable)
20:12:48.774851 poll([{fd=3, events=0}], 1, 0) = 0 (Timeout)
20:12:48.774898 recvfrom(3, 0x1049e98, 4008, 64, 0, 0) = -1 EAGAIN
(Resource temporarily unavailable)
20:12:48.774945 poll([{fd=3, events=0}], 1, 0) = 0 (Timeout)
20:12:48.775161 recvfrom(3, 0x1049e98, 4008, 64, 0, 0) = -1 EAGAIN
(Resource temporarily unavailable)
20:12:48.775210 poll([{fd=3, events=0}], 1, 0) = 0 (Timeout)
20:12:48.775257 recvfrom(3, 0x1049e98, 4008, 64, 0, 0) = -1 EAGAIN
(Resource temporarily unavailable)
20:12:48.775304 poll([{fd=3, events=0}], 1, 0) = 0 (Timeout)
20:12:48.775444 socket(PF_FILE, SOCK_STREAM, 0) = 4
20:12:48.775610 fcntl(4, F_GETFL) = 0x2 (flags O_RDWR)
20:12:48.775686 fcntl(4, F_SETFL, O_RDWR|O_NONBLOCK) = 0
20:12:48.775841 connect(4, {sa_family=AF_FILE,
path="/var/run/heartbeat/lrm_callback_sock"}, 110) = 0
20:12:48.775907 getsockopt(4, SOL_SOCKET, SO_PEERCRED,
20:12:48.775907 getsockopt(4, SOL_SOCKET, SO_PEERCRED,
"\214\5\0\0\0\0\0\0\0\0\0\0", [12]) = 0
...
Other commands like "crm configure verify" exhibits the same "hang"
although I have not traced these. I guess they must use lrmd too.
oYou can throw a SIGUSR1 to lrmd to print some debug.
This is the what was shown just after I executed the first
lrmadmin -C
Nov 22 21:12:39 node1 lrmd: [1038]: debug: on_op_done: BEGIN resource dump
Nov 22 21:12:39 node1 lrmd: [1038]: debug: on_op_done: resource vsftpd/vsftpd/upstart/<null>
Nov 22 21:12:39 node1 lrmd: [1038]: debug: on_op_done: rsc->op_list...
Nov 22 21:12:39 node1 lrmd: [1038]: debug: rsc->op_list: lrmd_op: operation monitor[20] on upstart::vsftpd::vsftpd for client 1041, its parameters: CRM_meta_name=[monitor] cr
Nov 22 21:12:39 node1 lrmd: [1038]: debug: rsc->op_list: lrmd_op2: rt_tag: 0, interval: 5000, delay: 0
Nov 22 21:12:39 node1 lrmd: [1038]: debug: rsc->op_list: lrmd_op3: t_recv: 210940ms, t_add: 20ms, t_perform: 20ms, t_done: 0ms, t_rcchange: 210940ms
Nov 22 21:12:39 node1 lrmd: [1038]: debug: on_op_done: rsc->repeat_op_list...
Nov 22 21:12:39 node1 lrmd: [1038]: debug: on_op_done: rsc->last_op_done...
Nov 22 21:12:39 node1 lrmd: [1038]: debug: rsc->last_op_done: lrmd_op: operation monitor[20] on upstart::vsftpd::vsftpd for client 1041, its parameters: CRM_meta_name=[monito
Nov 22 21:12:39 node1 lrmd: [1038]: debug: rsc->last_op_done: lrmd_op2: rt_tag: 0, interval: 5000, delay: 0
Nov 22 21:12:39 node1 lrmd: [1038]: debug: rsc->last_op_done: lrmd_op3: t_recv: 210940ms, t_add: 5040ms, t_perform: 5040ms, t_done: 5020ms, t_rcchange: 210940ms
Nov 22 21:12:39 node1 lrmd: [1038]: debug: rsc->last_op_table: lrmd_op: operation monitor[5] on upstart::vsftpd::vsftpd for client 1041, its parameters: crm_feature_set=[3.0.
Nov 22 21:12:39 node1 lrmd: [1038]: debug: rsc->last_op_table: lrmd_op2: rt_tag: 0, interval: 0, delay: 0
Nov 22 21:12:39 node1 lrmd: [1038]: debug: rsc->last_op_table: lrmd_op3: t_recv: 214510ms, t_add: 214510ms, t_perform: 214510ms, t_done: 214470ms, t_rcchange: 214510ms
Nov 22 21:12:39 node1 lrmd: [1038]: debug: rsc->last_op_table: lrmd_op: operation stop[8] on upstart::vsftpd::vsftpd for client 1041, its parameters: crm_feature_set=[3.0.1]
Nov 22 21:12:39 node1 lrmd: [1038]: debug: rsc->last_op_table: lrmd_op2: rt_tag: 0, interval: 0, delay: 0
Nov 22 21:12:39 node1 lrmd: [1038]: debug: rsc->last_op_table: lrmd_op3: t_recv: 213210ms, t_add: 213210ms, t_perform: 213210ms, t_done: 213180ms, t_rcchange: 213210ms
Nov 22 21:12:39 node1 lrmd: [1038]: debug: rsc->last_op_table: lrmd_op: operation monitor[20] on upstart::vsftpd::vsftpd for client 1041, its parameters: CRM_meta_name=[monit
Nov 22 21:12:39 node1 lrmd: [1038]: debug: rsc->last_op_table: lrmd_op2: rt_tag: 0, interval: 5000, delay: 0
Nov 22 21:12:39 node1 lrmd: [1038]: debug: rsc->last_op_table: lrmd_op3: t_recv: 210940ms, t_add: 5040ms, t_perform: 5040ms, t_done: 5020ms, t_rcchange: 210940ms
Nov 22 21:12:39 node1 lrmd: [1038]: debug: rsc->last_op_table: lrmd_op: operation start[19] on upstart::vsftpd::vsftpd for client 1041, its parameters: crm_feature_set=[3.0.1
Nov 22 21:12:39 node1 lrmd: [1038]: debug: rsc->last_op_table: lrmd_op2: rt_tag: 0, interval: 0, delay: 0
Nov 22 21:12:39 node1 lrmd: [1038]: debug: rsc->last_op_table: lrmd_op3: t_recv: 210990ms, t_add: 210990ms, t_perform: 210990ms, t_done: 210960ms, t_rcchange: 210990ms
Nov 22 21:12:39 node1 lrmd: [1038]: debug: on_op_done: END resource dump
Nov 22 21:12:39 node1 lrmd: [1038]: debug: on_op_done:operation monitor[20] on upstart::vsftpd::vsftpd for client 1041, its parameters: CRM_meta_name=[monitor] crm_feature_se
Nov 22 21:12:39 node1 lrmd: [1038]: debug: lrmd_op_destroy: free the operation monitor[20] on upstart::vsftpd::vsftpd for client 1041, its parameters: CRM_meta_name=[monitor]
Nov 22 21:12:39 node1 lrmd: [1038]: debug: replace_last_op: replace last op monitor_5000 for client crmd
Nov 22 21:12:39 node1 lrmd: [1038]: debug: lrmd_op_destroy: free the operation monitor[20] on upstart::vsftpd::vsftpd for client 1041, its parameters: CRM_meta_name=[monitor]
Nov 22 21:12:39 node1 lrmd: [1038]: debug: to_repeatlist: repeat operation monitor[20] on upstart::vsftpd::vsftpd for client 1041, its parameters: CRM_meta_name=[monitor] crm
Nov 22 21:12:39 node1 lrmd: [1038]: debug: lrmd_op_destroy: free the operation monitor[20] on upstart::vsftpd::vsftpd for client 1041, its parameters: CRM_meta_name=[monitor]
Nov 22 21:12:39 node1 lrmd: [1038]: debug: perform_op: no op to perform?
Nov 22 21:12:42 node1 lrmd: [1038]: debug: G_main_IPC_Channel_constructor(sock=11,11)
Nov 22 21:12:42 node1 lrmd: [1038]: debug: dumping request: >>>#012lrm_t=reg#012lrm_app=lrmadmin#012lrm_pid=2438#012lrm_gid=0#012lrm_uid=0#012<<<
Nov 22 21:12:42 node1 lrmd: [1038]: debug: on_msg_register:client lrmadmin [2438] registered
Nov 22 21:12:42 node1 lrmd: [1038]: debug: G_main_IPC_Channel_constructor(sock=12,12)
Nov 22 21:12:42 node1 lrmd: [1038]: debug: dumping request: >>>#012lrm_t=rclasses#012<<<
Nov 22 21:12:42 node1 lrmd: [1038]: debug: on_msg_get_rsc_classes:client [2438] wants to get rsc classes
Nov 22 21:12:42 node1 lrmd: [1038]: debug: on_receive_cmd: the IPC to client [pid:2438] disconnected.
Nov 22 21:12:42 node1 lrmd: [1038]: debug: G_CH_destroy_int(chp=0x25e5830, sock=11) {
Nov 22 21:12:42 node1 lrmd: [1038]: debug: G_CH_destroy_int: Calling dnotify(sock=11, arg=0x25ea310) function
Nov 22 21:12:42 node1 lrmd: [1038]: debug: G_main_del_IPC_Channel(sock=11)
lrmd doesnt respond to subsequent signal requests so I cannot get any further debug using this technique.
backtrace from gdb shows lrmd to be in a lock_wait
#0 0x00007f7e5f8ba6b4 in __lll_lock_wait () from /lib/libpthread.so.0
#1 0x00007f7e5f8b5849 in _L_lock_953 () from /lib/libpthread.so.0
#2 0x00007f7e5f8b566b in pthread_mutex_lock () from /lib/libpthread.so.0
#3 0x00007f7e601b0806 in g_main_context_find_source_by_id () from /lib/libglib-2.0.so.0
#4 0x00007f7e601b08fe in g_source_remove () from /lib/libglib-2.0.so.0
#5 0x00007f7e61568ba1 in G_main_del_IPC_Channel (chp=0x11deed0) at GSource.c:495
#6 0x00000000004065a1 in on_remove_client (user_data=0x11df8e0) at lrmd.c:1526
#7 0x00007f7e615694ca in G_CH_destroy_int (source=0x11deed0) at GSource.c:675
#8 0x00007f7e601adc11 in ?? () from /lib/libglib-2.0.so.0
#9 0x00007f7e601ae428 in g_main_context_dispatch () from /lib/libglib-2.0.so.0
#10 0x00007f7e601b22a8 in ?? () from /lib/libglib-2.0.so.0
#11 0x00007f7e601b27b5 in g_main_loop_run () from /lib/libglib-2.0.so.0
#12 0x0000000000405d32 in init_start () at lrmd.c:1267
#13 0x0000000000404f7a in main (argc=1, argv=0x7fff91e24478) at lrmd.c:835
Looking at the source code is that upon lrmadmin client disconnecting from
lrmd's cmd socket (having got what it needs) lrmd is left to tidy up by
deleting the client event source from the GMainContext GLib loop. It is in the
process of calling g_source_remove() which then hangs deep inside GLib on a
mutex lock.
On the surface the overall sequence makes sense but the hang doesnt and
clearly shouldnt happen. I am at a loss as to whether it is a GLib
issues (unlikely I would have thought?) or its an lrmd bug.
IMHO lrmd should NEVER hang!
Thanks
Dave
_______________________________________________________
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page:
http://linux-ha.org/