I hope this question had never been asked before, but I searched in the Nagios documentation and I didn’t find any answers.
I just want to know a thing :
If a host goes down or unreachable (HARD state), does Nagios will check every services of this host or Nagios will pass the services check to improve performance ?
If it pass services check, that’s cool, I have nothing to change.
But if not ? there is a way to prevent Nagios to check services of a host down or unreachable ?
bad news for you:
nagios keeps on checking the services and doesn’t even check the host anymore … (which can be a problem, btw!)
Here is a scenario:
-a service goes down on a host
=> nagios tests the host
host is found DOWN => nagios notifies host DOWN
nagios keeps on testing the services, and notifying HOST DOWN
one service goes UP => the host is assumed to be UP => the HOST UP notification is sent
it has some logic, especially knowing that (apparently - I don’t know why), host checks are more ressource consumming than service checks (host checks are not scheduled, so they have to be inserted in the plan).
Hope this is clear enough
btw: my problem is that, even if the host goes UP (machine is restarted) but no service is UP (process monitored not restarted, or NRPE not restarted), nagios won’t detect that the host is up and will still notify the HOST DOWN alert … which is really stupid and confusing for the team that reads the alerts.
I think I’m going to “hack” Nagios to prevent services check when host state is not UP. That’s the only solution I think.
But for your problem, I think it’s because you let de default value for Check Interval in host configuration with 60 minutes (or active check isn’t enabled)
Like you said, Nagios execute host check on first time Nagios run, every change in related services and at regular intervals of 60 minutes (by default and if active check isn’t disabled). So if you change default value to 5 mins i.e. .even if your services won’t come back to state UP, Nagios will check the host reachability :D, so your host may change state to UP.
How about using an event handler in the host object to fire the DISABLE_HOST_SVC_CHECKS external command when the host is down (and conversely ENABLE_HOST_SVC_CHECKS when it comes back up) ?
I just hope that script and external command are safe and won’t bug when host recover UP state. Can’t let host services unchecked for customer network, or I’m a dead man !lol
No worries. Always worth trying these potentially career limiting work-arounds in a test environment first if possible, but hey, I’m sure it’ll work nearly as well as you hope.