Host Checks and Service Checks


#1

I’m in a position where I’m being forced to explain Nagios behavior in the following scenario. The ICMP Status column reflects the status of the host if I ran a ping from the Nagios server.

Time     ICMP Status      Host Status          Service Status        Notification
1              OK             Up                   Up                    none
2              OK             Up                   Down                  Service
3              timeout        Down                 Down                  Host
*4             OK             Down                 Down                  none

I’m trying to understand the behavior in Step 4. From a different post ([http://www.meulie.net/portal_plugins/forum/forum_viewtopic.php?13781](hyperlink url)) I saw the follwoing statement:

[blockquote]The status of a host is only checked when a service check results in a non-OK state or if the host was previously down and a service check results in an OK state. [/blockquote]

I am running version 2.9. From the table above, it appears that Nagios did not perform a Host Check even when the Host recovered. And based on the statement above, this is expected behavior. I have two questions about it:

  1. Can someone point me to a spot in the v2.9 documentation which addresses this scenario? I can’t seem to find explicit mention of it.
  2. Can you help me understand the logic behind this behavior? Intuitively I expected Nagios to know when the a Host recovers and alert, even if the Service doesn’t.

I don’t want to give you the impression I think this is wrong behavior. I just need to understand it better. Thanks.


#2

I think this answers your questions - nagios.sourceforge.net/docs/2_0/ … uling.html
[blockquote]If the first host check returns a non-OK state, Nagios will keep pounding out checks of the host until either (a) the maximum number of host checks (specified by the max_attempts option in the host definition) is reached or (b) a host check results in an OK state.[/blockquote]
So, once max_attempts has been reached, the host checks stop. Why it does this can be seen further on, thus:
[blockquote]Also of note - when Nagios is check the status of a host, it holds off on doing anything else (executing new service checks, processing other service check results, etc). This can slow things down a bit and cause pending service checks to be delayed for a while, but it is necessary to determine the status of the host before Nagios can take any further action on the service(s) that are having problems. [/blockquote]
Therefore, you don’t want nagios to keep banging in host checks as because it views them as high-priority, it will supercede the scheduling for other checks. When this period of host checks end though, your service checks keep on trucking, so it is normal that you might want to include a service check that replicates the host alive check, like if you are pinging the host, ping it as a service check as well. Then, when the ping service check recovers, that will cause the host check to fire and it will recover too.

I can’t see anything more in the docs specifically regarding the part of the statement you quoted and marked in red (I can’t even view the original post and get a 404) but in my mind this must be the case or after max_attempts has been exceeded on the host check it would never fire again, and the host would never recover, ever.

That’s how I understand it anyway.

HTH

/S


#3

Thank you for your reply. In case you wanted to look at the thread I quoted, it is named “Nagios stopped to check host” - on the second page I think.

I think I was aware of this, but I’m not sure I fully comprehended the implications of basically stopping everything else when it has to perform a host check. And with that bit of information, the behavior makes a lot more sense. The doc you pointed me at (I can’t understand how you found that one line) does a good job of explaining what happens when a host goes down. But there is a glaring omission in the doc of addressing what happens when the host recovers.

The idea of creating a ping service also makes a lot of sense too. Thanks for that suggestion. That definitely would have mitigated this scenario.