My company has nagios set up and running on a rather small set of hosts (Just under 100 hosts), each of which execute approximately 20 service checks over 5 minutes (check_ntp, and check_snmp_if checks).
For some reason I can’t understand, we will have occasional service check failures througout the day. Sometimes as few as 5 a day, sometimes as many as 15 a day. These failures are always formatted in this way (I’ve included generic names for the service and the host):
Warning: Check of service ‘service_desc’ on host
’hostname’ did not exit properly!
This happens on any kind of service check, and there is no pattern to the time it happens. The log does not show any additional information, or suspicious entries before or after these failures.
In all of the cases in which this error happens, the next time the service check is run for that host it executes properly. So, most of the time things work fine. We just can’t explain the occasional failure.
Does anyone have any advice in troubleshooting this problem? It’s getting so that we’re generating several “false” pages each day, and the system is getting unreliable for us.