My nagios is working well and this is a great software.
My problem lies in the service check, right now im getting numerous email and sms notification from the host whenever it is unable to reach the hosts, and by the time it does the second check , the hosts is already up and this has caused a lot of emails and sms being sent.
Is there a way, i can control the host check to wait at least 2 minutes and make sure the host is up before it send out the notifications ?
The thing is, the hosts is defined as down when it cant ping the destination, but this happens only for a small period of time and by 2 minutes the hosts are up. What can i do to overcome this ? Is there any settings ?
Any help in this is appreciated, as Im getting like 300 sms a night (host down , host up) .
These settings give a notification only after 3 failed checks (one minute time between each other)…
You can even use service escalations… I get an SMS only after 30 minutes of down… i get 3 emails before an SMS is sent, and SMSs get sent only on really critical services. if FTP goes down on a web server i really don’t care at 3 in the morning… if HTTP goes down i do.
As luca has more or less stated, it’s all due to your config settings. You must not be doing max_check_attempts or something. Sure, pings will fail, but if they do fail, then surely you want to try it again at least once, before you wake someone up with a pager.
Does this means a notification is sent after 10 failed checks and 2 minutes time between each other ? Should I change normal_check_interval to 10 ?
Im doing the changes today so I hope not to receive a lot of sms again.
You guys are my saviour!!
p.s. Jake, yes, i do want to try the ping at least 3 times before it actually send me the sms of the unreachable host, else ill be going back to square 1. So where should i look at ?
normal_check_interval 5 # checks will be performed every 5 intervals (check your nagios.cfg for interval_length=x to determine how long an interval is)
max_check_attempts 10 #if a check fails, it will be checked again for a total of 10 times. During this time, no notifications will be sent and the status information will show what output it is getting. The “attempt” column in "service problems’ will show 1/10 for the attempt that it is currently on.
retry_check_interval 2 # if a check fails, a retry will be attempted every 2 intervals.
Now, since a retry occurs every 2 for 10 times, that works out to be 20 intervals, but your normal check interval is 5. That’s on overlap and I’ve never tested it. You can test it and make sure that you see the “attempt” collumn reaches 10/10 before a notification goes out by downing a device or make up a bogus check hostaddress ip.
I would suggest to use
max_check_attempts 3
normal_check_interval 5
retry_check_interval 1
which should be plenty even for a ping.
BTW, you should be using check_fping and not ping. It gives a quicker reply and is important when you are doing 1000’s of checks.