Mailing List Archive: snmp_portscan.nes lockups

Hello,

Ever since 2.0.7, I've seen some snmp_portscan lockups on other people's
networks. I've never had this happen to my network (despite vigerous
attempts to reproduce the problem) but I can't imagine what situation would
cause the module to lockup like that. Here are the details surrounding this
[potential] issue...

- snmpwalk isn't running when I look at the ps output - but there will
always be at least a couple nessusd processes that specified that it was
running snmp_portscan.nes

- The timeout seems to have been activated after 60 seconds (according to
nessusd.messages), which - from what I can tell - is the proper timeout
(it's hardcoded into the module). However, while that MIGHT be killing
snmpwalk (although most likely not, keep reading), it's not killing the
snmp_portscan.nes plugin.

- snmpwalk is installed in /usr/bin. There are definitely no other copies
of snmpwalk on the system.

- kill -9-ing the nessusd process with the snmp_portscan.nes message allows
the scan to continue normally.

- snmpwalk has been 5.0.1 (redhat build, where the version ID is interpreted
as the old-style commandline args, but in fact uses the new commandline
argument style, which causes the plugin to create an invalid commandline),
5.1.0 and 5.1.1 (the last two being fedora RPMs). I should point out again
that I've never had snmpwalk or snmp_portscan hang on any scan I'VE kicked
off for any reason - it always seems to timeout properly.

- The hangs have occurred on the FreeBSD 4.8, FreeBSD 4.9 and Linux 2.6
kernels, each time during a single scan, but with max_checks set to 5,
max_hosts set to 20, for a total of 100 nessusd scanning processes. Memory
definitely didnt appear to be an issue - at least not when I checked. The
systems in question that have had this problem all have at least 1GB of RAM,
so running out of memory seems unlikely, but I guess it's possible.

- Lockups do NOT happen each time - they SEEM random - they can happen
during heavy activity times, as well as times of no activity

- The systems where the lockups are occurring have no development
applications, so I can't run gdb, strace, truss, etc to figure out exactly
what nessusd or snmp_portscan.nes are hanging on.

That being said, I looked through the source code, and I noticed something
that SEEMS wrong - I can't say for sure if it is actually a bug, since I
haven't had time to work out if nessusd is doing some things that I'm not
noticing ...

The wrong-seeming thing is that only the first nessus_popen (used by the
version detection routine) saves the process ID in snmpwalk_process. That
means that when the timeout occurs, and SIGTERM is sent to
snmp_portscan.nes, it will - at best - kill the version detection. However,
from there, it appears like it will just keep running.

To verify my theory, I created a binary called snmpwalk that would return a
valid snmpwalk version string when called with -V, but would sleep for 30
minutes in all other cases. As I'd predicted, SIGTERM was called, but
nothing was killed, because snmp_portscan.nes was trying to kill the pid
from the version scan.

Next, I started saving the pid for each nessus_popen (in the current version
of the plugins that I have (2.0.10a), this is normally given a NULL
parameter. I changed the NULL to &snmpwalk_process, and sure enough, the
FIRST snmpwalk process is killed, but then snmp_portscan will run the next
snmpwalk process, and so on.

So, I added some additional code at the end of the sigterm_handler code to
kill -9 getpid(), and THAT seemed to do what I was looking for.

Obviously, my version of snmpwalk isn't a perfect test - I've never had
snmpwalk lock up on me, or fail to timeout, regardless of the network
condition (routes down, dead host, packet filtering, etc) - even when the
nessusd server's load is well over 100, and the local network is nearly
saturated... therefore I am ASSUMING the problem isn't related to snmpwalk
being frozen, but more likely is some sort of blocking / deadlock related to
some code in the plugin.

So after that lengthy explanation, I have the following questions:

- Is snmp_portscan missing functionality, or am I misunderstanding how the
timeout process works? That is, should it really ONLY be trying to kill a
pid that has most likely exited? I would think a timeout routine would want
to exit from the module as soon as the signal is caught - but none of the
other plugins seem to do this either, making me wonder whether all plugins
are at the

- Is anyone else having these hangs? Does anyone know how to reproduce
this? When researching this problem, I found a reference to this in the
archives, but it was regarding 2.0.7 or earlier, IIRC. I would think this
would cause problems for more people than me.

- On another note, is it possible that the version detection routines are
incorrect? It SEEMS like the commandline arguments changed at version 5,
not > 5.0.5, like the version detection routines say. However, knowing
Redhat, they might've back-ported some 5.0.5+ functionality into the 5.0.1
version, then left the version # alone.

- On yet another note, is it a good idea to hardcode a timeout value? Yes,
its a good idea to make sure the timeout is never too low (60 seconds is
usually enough time) - BUT if someone has changed the snmpwalk timeout to
something like 30 seconds, the plugin will timeout prior to kicking off all
of the snmpwalk processes (again, assuming I understand what the timeout is
doing).

I would really appreciate any help anyone could give me - there are a few
blocking functions that could theoretically cause the deadlocks these people
are having - assuming that there's some sort of race condition /
timing-related ossue. At the same time, I don't see how the existing
timeout code could possibly remedy a lockup / deadlock with the
snmp_portscan plugin. I should also say that other than this lockup, nessus
v2.0.10a has been extremely stable and reliable.

Let me know if I can do anything to help resolve this issue - I can send you
the nessus.conf config file that was used for the scan. I can also try to
get any other log file from the problematic nessus server - just ask and I'd
be more than happy to try to get at it.

Thank you for your time,

Brian Costello
btx@calyx.net