I noticed last week that when a host goes down with active services, the service status behaves differently depending on the order of events.
Active Service A and Host are OK/Up. The host goes down in a hard state after X retries. Active Service A comes along and attempts a service check and goes to a critical state and logs a failed attempt, say 1 of 5. The host is still down, but 5 minutes (normal retry interval) later, Service A is checked again. It fails again, but does not increment the attempts to 2 and also resets the duration of the problem. This is repeated for the service check indefinitely. Another annoyance is that each of these failed “soft” attempts is logged each time.
Active Service A and Host are OK/Up. Service A goes to a critical hard state after X retries. The host check is then run and after X retries, it goes to a critical down state. At this point, the service checks keep running at their normal intervals and the duration for Service A increments correctly.
The end result of Scenario 2 is how I would think Scenario 1 should behave. It seems to me that the service check failed attempts should still increment properly even if the host is down. Resetting the duration seems incorrect as well.
This doesn’t necessarily impact the effectiveness of the checks, but it is confusing in the web UI because of the number of attempts as well as the duration of the problem.