Mailing List Archive: omelasticsearch getting stuck after network issues

Hi,
we're using rsyslog to send logs to ES using omelasticsearch

One issue we're having is that from time to time, after a network issue,
the queue for messages going to ES keeps growing, We are writing to
multiple indices in the same cluster, ant this issue is only happening for
some of the indices (and not always the same ones).

What I found so far:

Dropping traffic from ES is a relatively reliable way to trigger the issue:
iptables -A INPUT -p tcp --source <ES proxy IP> --sport 9200 -j DROP ;
sleep 180 ; iptables -D INPUT -p tcp --source <ES proxy IP> --sport 9200 -j
DROP

Force-closing ES sockets in the worker (via GDB) fixes the issue (queue :
for FD in `lsof -p $PID -n -a -i TCP:wap-wsp -a -s TCP:ESTABLISHED -F f |
grep '^f' | cut -d f -f 2`; do gdb -batch -ex 'set logging on' -ex 'set
logging redirect on' -ex "attach $PID" -ex "call shutdown($FD, 1)" -ex
'detach' -ex 'quit'; done

My current theory is that, one or more libcurl requests gets in a state
where rsyslog has sent a request to ES, ES has tried to send the answer
back to rsyslog for some time, but then gave up and closed the connection,
and since there is no timeout on the request, omelasticsearch is stuck
forever.

But of course I might be dead wrong and there is a simple explanation.

I wanted to try to set a timeout on the ES request, to see if this fixes
the issue, but first I wanted to ask if there is a specific reason why
there is no option to set the timeout on the ES request, or it just was not
implemented at the time?

Kind regards,
Mattia
_______________________________________________
rsyslog mailing list
https://lists.adiscon.net/mailman/listinfo/rsyslog
http://www.rsyslog.com/professional-services/
What's up with rsyslog? Follow https://twitter.com/rgerhards
NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad of sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you DON'T LIKE THAT.