I noticed last week that when a host goes down with active services, the service status behaves differently depending on the order of events.
Scenario 1:
Active Service A and Host are OK/Up. The host goes down in a hard state after X retries. Active Service A comes along and attempts a service check and goes to a critical state and logs a failed attempt, say 1 of 5. The host is still down, but 5 minutes (normal retry interval) later, Service A is checked again. It fails again, but does not increment the attempts to 2 and also resets the duration of the problem. This is repeated for the service check indefinitely. Another annoyance is that each of these failed “soft” attempts is logged each time.
Scenario 2:
Active Service A and Host are OK/Up. Service A goes to a critical hard state after X retries. The host check is then run and after X retries, it goes to a critical down state. At this point, the service checks keep running at their normal intervals and the duration for Service A increments correctly.
The end result of Scenario 2 is how I would think Scenario 1 should behave. It seems to me that the service check failed attempts should still increment properly even if the host is down. Resetting the duration seems incorrect as well.
This doesn’t necessarily impact the effectiveness of the checks, but it is confusing in the web UI because of the number of attempts as well as the duration of the problem.
Follow-up.
I’ve found through more investigation that if the host down hard state is reached before the 2nd attempt of checking Service A, the results are what is described in Scenario 1.
If one or more retries have completed on Service A before the host goes to a hard/down state, then the results are what is described in Scenario 2.
So, for example, if you have the max check attempts, normal check interval and retry interval set the same for both the host and the service, you can very easily get into the situation described in Scenario 1, which again, is misleading. A potential workaround is to set the parameters of the host check such that 2 normal checks occur in the time that 1 normal check occurs for the host. This ensures that 2 checks are executed during the hosts soft state.
Another follow-up
Upon further review, the exact criteria isn’t whether a retry has been completed on the service check or not, it has to do with whether the host check or the service check is completed first.
If the first DOWN host check is completed before the first non-OK service check, we get into scenario 1 and things are reported incorrectly. However, if the opposite occurs, we get the scenario 2 results and everything seems to work correctly.
So, my above workaround really won’t make any difference. It’s all about whether or not the first non-OK service check is completed before the host check determines the system is down.
One more time here.
I guess I’ll answer my own question/problem. After doing the same tests using nagios 3.3.1 (previously tested on 3.2.3) it appears that this is fixed. So, no matter which order the first failed checks are completed (host or service first) I get the same behaviour. Once the host check determines that the host is down, the active service checks stop all together and the duration for the service increments properly. It no longer continuously writes a failed soft attempt for the service either. So, I’m off to upgrading my servers to 3.3.1…