Notification after 2nd time service is down


#1

Hi,

Is there a way to configure Nagios so that it only sends out an email notification the second time a service is down? If have it configured to try three times (max_check_attempts) when a service goes down, but this does not do what I want.

We have some unreliable links that go down occasionally for a few minutes but generally come back up 5 minutes later, so I don’t want to be notified unless it is down after two five minute check cycles.

Thanks in advance.


#2

Hey there,

I had the same problem. Your best bet is to modify your max_check_attempts to 3 and your retry_interval to 2 in your service definition, it will not change state to CRITICAL until 6 minutes is up. Modify as necessary… 3x2=6 …raise the numbers if you want it to fail more times before it goes critical.

You could do this or set up a service escalation policy for those services (ie, first alert goes to a null address, second alert goes to you). Have a look at the nagios docs for how escalations work


#3

Hey MP,

Thanks for that. I’ve been trying unsuccessfully to do just that. Would you be able to post some snippets of your configuration?


#4

It’s as easy as including those directives in your service definitions for the flaky links:

define service{
use generic-service
host_name xx-flakey
service_description Check Flakey Website
is_volatile 0
check_period 24x7
max_check_attempts 3
normal_check_interval 5
retry_check_interval 2
contact_groups critical-admins
check_command check_http!www.flakey.com
}

normal_check_interval is 5 (normal), but if a service criticals, nagios will disregard the normal check_interval and check again after the retry_check_interval. It criticals again, and again it will check after the retry_check_interval. Once max_check_attempts has been reached (in this case, 3) then nagios will change the state of the service to HARD (critical) and send out an alert. Just increase the retry_check_interval or max_check_attempts to raise hte time it takes nagios to change the state of the service.

As long as max_check_attempts and retry_interval are set in your service definition, they will override any template settings.


#5

Thanks MP!

Now it all makes sense.

Thanks for the help.