My company has nagios set up and running on a rather small set of hosts (Just under 100 hosts), each of which execute approximately 20 service checks over 5 minutes (check_ntp, and check_snmp_if checks).
For some reason I can’t understand, we will have occasional service check failures througout the day. Sometimes as few as 5 a day, sometimes as many as 15 a day. These failures are always formatted in this way (I’ve included generic names for the service and the host):
Warning: Check of service ‘service_desc’ on host
’hostname’ did not exit properly!
This happens on any kind of service check, and there is no pattern to the time it happens. The log does not show any additional information, or suspicious entries before or after these failures.
In all of the cases in which this error happens, the next time the service check is run for that host it executes properly. So, most of the time things work fine. We just can’t explain the occasional failure.
Does anyone have any advice in troubleshooting this problem? It’s getting so that we’re generating several “false” pages each day, and the system is getting unreliable for us.
put a higher timeout on the checks or set nagios not to notify on first checks… should have it set to notify after three failures you should get the problem when it happens… keep an eye on the webinterface and try the check by hand when it gets a soft critical (before sending notifications)
[quote=“luca”]put a higher timeout on the checks or set nagios not to notify on first checks… should have it set to notify after three failures you should get the problem when it happens… keep an eye on the webinterface and try the check by hand when it gets a soft critical (before sending notifications)
Luca[/quote]
The concern I have with changing the timeout is twofold: 1) This is happening with several native nagios plugins, such that changing the timeout would involve modifying the plugin source and recompiling, and 2) If the plugins were indeed timing out, nagios should report back the plugins’ response that the plugin had timed out.
Is anyone familiar with the source code in base/checks.c, enough to understand the exact conditions that trigger the “did not exit properly” response? We’ve been having a bit of trouble nailing it down.
It turns out our local modifications to the service_message structure pushed it over 512 bytes, making its write() over the pipe in write_svc_message non-atomic. read_svc_message was then reading only a partial message, and the zeroed-out fields triggered the “did not exit properly!” errors.
I posted (via gmane) to the nagios-devel list with the details and a patch.