I’m running nagios v2 - just testing on a single host for now. When the host goes down, nagios doesn’t seem to be running through the max_check_attempts on the service config, but instead goes directly to the host max_check_attempts and sends the notifcation out with a minute - with the service config, it should delay it at least 5 minutes.
The status of the service shows current attempt at 1/5.
The event log looks like this:
[03-15-2006 20:31:29] SERVICE ALERT: chris;ICMP;CRITICAL;HARD;1;CRITICAL - Plugin timed out after 10 seconds
[03-15-2006 20:31:29] HOST NOTIFICATION: czane;chris;DOWN;test-host-notify;CRITICAL - Plugin timed out after 10 seconds
[03-15-2006 20:31:29] HOST ALERT: chris;DOWN;HARD;5;CRITICAL - Plugin timed out after 10 seconds
[03-15-2006 20:31:19] HOST ALERT: chris;DOWN;SOFT;4;CRITICAL - Plugin timed out after 10 seconds
[03-15-2006 20:31:09] HOST ALERT: chris;DOWN;SOFT;3;CRITICAL - Plugin timed out after 10 seconds
[03-15-2006 20:30:59] HOST ALERT: chris;DOWN;SOFT;2;CRITICAL - Plugin timed out after 10 seconds
[03-15-2006 20:30:49] HOST ALERT: chris;DOWN;SOFT;1;CRITICAL - Plugin timed out after 10 seconds
[03-15-2006 20:25:19] INITIAL SERVICE STATE: chris;ICMP;OK;HARD;1;(Service assumed to be ok)
OKay. I changed the service check to check-ifoperstatus. Since this switch is down, it will fail.
here’s the event log. seems to be the same problem:
[03-16-2006 05:00:37] SERVICE ALERT: chris;check_ifoperstatus;CRITICAL;HARD;1;CRITICAL: SNMP error with snmp version 1 (No response from remote host ‘172.18.132.153’)
[03-16-2006 05:00:37] HOST NOTIFICATION: czane;chris;DOWN;test-host-notify;CRITICAL - Plugin timed out after 10 seconds
[03-16-2006 05:00:37] HOST ALERT: chris;DOWN;HARD;5;CRITICAL - Plugin timed out after 10 seconds
[03-16-2006 05:00:27] HOST ALERT: chris;DOWN;SOFT;4;CRITICAL - Plugin timed out after 10 seconds
[03-16-2006 05:00:17] HOST ALERT: chris;DOWN;SOFT;3;CRITICAL - Plugin timed out after 10 seconds
[03-16-2006 05:00:07] HOST ALERT: chris;DOWN;SOFT;2;CRITICAL - Plugin timed out after 10 seconds
March 16, 2006 04:00
[03-16-2006 04:59:57] HOST ALERT: chris;DOWN;SOFT;1;CRITICAL - Plugin timed out after 10 seconds
[03-16-2006 04:54:27] INITIAL SERVICE STATE: chris;check_ifoperstatus;OK;HARD;1;(Service assumed to be ok)
137:# This value works of the interval_length you specify later. If you leave
141:# actual seconds rather than a multiple of the interval_length variable.
553:interval_length=60
it’s odd that the event log shows that the service alert occurs after the host alert.
I suspect that active checks are enabled for hosts. Not a good thing. Why bother with a host check? Host checks are performed only when a service check has failed after it’s max retries.
Correction to the above. Active checks should be enabled in hosts.cfg, but they should NOT be scheduled to occur on a regular basis due to the check_interval setting. Dont define check_interval at all for your hosts, or they wiill end up severely impacting your nagios setup/performance.
Yeah. that’s what is happening - as soon as the service ping check fails, it tries the host check, which of course fails and thus the instant down notice.
I still need host checks for hosts that have services (ifOperStatus, other snmp polls) so i can’t disable active host checks across the board.
what I ended up doing is creating a escalation for the host - the initial notification goes to a nobody email address (i.e. /dev/null) and then is escalated to the real notification contact group after a set duration. A kludge, but it works for what I need.
When I have more time i’ll see how I can improve the template setup that I have to enable host checks for only hosts that have services attached and disable it for those that I’m just pinging.
Yes, you are right, and I am going to edit my reply, so I don’t mess someone else up.
What I should have said was that you might have hosts being scheduled for checking due to your check_interval setting in hosts.cfg. That would not be a good thing and can severely impact your performance.