I am familiar with the following text from the Nagios documentation:
“Also of note - when Nagios is check the status of a host, it holds off on doing anything else (executing new service checks, processing other service check results, etc). This can slow things down a bit and cause pending service checks to be delayed for a while, but it is necessary to determine the status of the host before Nagios can take any further action on the service(s) that are having problems.”
However, I interpretted that to mean that Nagios would need to delay future service checks on a particular host while it determined if the host was up or down. In reality, I’ve found that it delays all checks on all hosts. I’m wondering if this is fully intentional, or if there is a workaround.
As an example, we have a moderately sized network with approximately 300 hosts being monitored by Nagios. Now, if some accident happens such that 100 of these hosts are suddenly unreachable by Nagios, I have observed that nagios will check the service on one of the hosts, find it is down, and check the host, and determine it is down, before moving on to any other services. This means that if it takes, for example, 20 seconds to confirm that the host is down, it might be a half hour before nagios notifies that each of those 100 hosts is down. This seems excessive.
Is there any way for Nagios to be a bit more clever about this? For it to understand that if one host is down it doesn’t need to pause every other check on every other host while it checks the status of the first host? Not only does this delay important outage information, it also delays our performance data checking services, such that the performance data we collect and log gets thrown off, basically unavailable during an outage, even for those services that are not affected by the outage at all.
I know about dependencies as a way to prevent certain cases of this happening, but there are always times when things go wrong in an unpredictable way. It seems that Nagios behaves most poorly in cases when quick responses are needed most.