Service check retry not being performed?


#1

I’m running nagios v2 - just testing on a single host for now. When the host goes down, nagios doesn’t seem to be running through the max_check_attempts on the service config, but instead goes directly to the host max_check_attempts and sends the notifcation out with a minute - with the service config, it should delay it at least 5 minutes.

The status of the service shows current attempt at 1/5.

The event log looks like this:

[03-15-2006 20:31:29] SERVICE ALERT: chris;ICMP;CRITICAL;HARD;1;CRITICAL - Plugin timed out after 10 seconds
[03-15-2006 20:31:29] HOST NOTIFICATION: czane;chris;DOWN;test-host-notify;CRITICAL - Plugin timed out after 10 seconds
[03-15-2006 20:31:29] HOST ALERT: chris;DOWN;HARD;5;CRITICAL - Plugin timed out after 10 seconds
[03-15-2006 20:31:19] HOST ALERT: chris;DOWN;SOFT;4;CRITICAL - Plugin timed out after 10 seconds
[03-15-2006 20:31:09] HOST ALERT: chris;DOWN;SOFT;3;CRITICAL - Plugin timed out after 10 seconds
[03-15-2006 20:30:59] HOST ALERT: chris;DOWN;SOFT;2;CRITICAL - Plugin timed out after 10 seconds
[03-15-2006 20:30:49] HOST ALERT: chris;DOWN;SOFT;1;CRITICAL - Plugin timed out after 10 seconds
[03-15-2006 20:25:19] INITIAL SERVICE STATE: chris;ICMP;OK;HARD;1;(Service assumed to be ok)

host configuration:
define host {
host_name chris
alias chris
address 172.18.132.153
hostgroups ManoaNet
check_command check-host-alive
max_check_attempts 5
check_interval 0
process_perf_data 0
retain_nonstatus_information 1
contact_groups netops_group
notification_period 24x7
notification_interval 0
notification_options d,r
}

service config:

define service {
host_name *
service_description ICMP
check_command check_ping!500,40%!1000,100%
max_check_attempts 5
normal_check_interval 5
retry_check_interval 1
check_period 24x7
contact_groups netops_group
notification_options c,r
notification_period 24x7
notification_interval 0
notifications_enabled 1
}

i must be overlooking something really simple…

thanks!
chris


#2

use a differente service like check_http to test. an icmp ping is te same as the host ckeck alive…

Luca


#3

OKay. I changed the service check to check-ifoperstatus. Since this switch is down, it will fail.

here’s the event log. seems to be the same problem:

[03-16-2006 05:00:37] SERVICE ALERT: chris;check_ifoperstatus;CRITICAL;HARD;1;CRITICAL: SNMP error with snmp version 1 (No response from remote host ‘172.18.132.153’)
[03-16-2006 05:00:37] HOST NOTIFICATION: czane;chris;DOWN;test-host-notify;CRITICAL - Plugin timed out after 10 seconds
[03-16-2006 05:00:37] HOST ALERT: chris;DOWN;HARD;5;CRITICAL - Plugin timed out after 10 seconds
[03-16-2006 05:00:27] HOST ALERT: chris;DOWN;SOFT;4;CRITICAL - Plugin timed out after 10 seconds
[03-16-2006 05:00:17] HOST ALERT: chris;DOWN;SOFT;3;CRITICAL - Plugin timed out after 10 seconds
[03-16-2006 05:00:07] HOST ALERT: chris;DOWN;SOFT;2;CRITICAL - Plugin timed out after 10 seconds

March 16, 2006 04:00

[03-16-2006 04:59:57] HOST ALERT: chris;DOWN;SOFT;1;CRITICAL - Plugin timed out after 10 seconds
[03-16-2006 04:54:27] INITIAL SERVICE STATE: chris;check_ifoperstatus;OK;HARD;1;(Service assumed to be ok)


#4

Please display the output of this command:

cat /usr/local/nagios/etc/nagios.cfg |grep -n “interval_length”

Seems like your interval length is set to “10” instead of the default, “60”. But that’s just a guess until you check it out for yourself.


#5

I doublechecked that… it’s set for 60.

cat /usr/local/nagios/etc/nagios.cfg |grep -n “interval_length”

137:# This value works of the interval_length you specify later. If you leave
141:# actual seconds rather than a multiple of the interval_length variable.
553:interval_length=60

it’s odd that the event log shows that the service alert occurs after the host alert.


#6

I suspect that active checks are enabled for hosts. Not a good thing. Why bother with a host check? Host checks are performed only when a service check has failed after it’s max retries.

Correction to the above. Active checks should be enabled in hosts.cfg, but they should NOT be scheduled to occur on a regular basis due to the check_interval setting. Dont define check_interval at all for your hosts, or they wiill end up severely impacting your nagios setup/performance.


#7

Yeah. that’s what is happening - as soon as the service ping check fails, it tries the host check, which of course fails and thus the instant down notice.

I still need host checks for hosts that have services (ifOperStatus, other snmp polls) so i can’t disable active host checks across the board.

what I ended up doing is creating a escalation for the host - the initial notification goes to a nobody email address (i.e. /dev/null) and then is escalated to the real notification contact group after a set duration. A kludge, but it works for what I need.

When I have more time i’ll see how I can improve the template setup that I have to enable host checks for only hosts that have services attached and disable it for those that I’m just pinging.

thanks for the help!


#8

Yes, you are right, and I am going to edit my reply, so I don’t mess someone else up.
What I should have said was that you might have hosts being scheduled for checking due to your check_interval setting in hosts.cfg. That would not be a good thing and can severely impact your performance.