Too many notifications!

system · November 4, 2005, 10:40pm

I’ve got Nagios 2.0.4b up and running just fine…well, almost. Thing is, upon testing to see if Nagios was indeed monitoring our network correctly, we killed the link to one of our servers and received SIX notification e-mails (7 if you include the one from the pre-existing Nagios server; this second one is being used for redundancy purposes). I can’t think of where to look to change this. Any suggestions?

-Thunder

luca · November 6, 2005, 4:03pm

Check WHAT you are getting notified for… services, dependent hosts or the like.

Luca

hornet_man · November 7, 2005, 5:34pm

I use the Notify Only Once option with escalations notification_interval set to 0 will notify only once

system · November 7, 2005, 8:55pm

[quote=“luca”]Check WHAT you are getting notified for… services, dependent hosts or the like.

Luca[/quote]

I’m getting notified multiple times of the HTTP service check on one of our hosts. On a separate host I have ping set as the only service checked. When it goes down and recovers, I only get one notification.

luca · November 8, 2005, 3:49pm

check in the notifications page who is being notified and hoiw often… then try to work down through contact groups and contacts… possibly some wrong settings there?

Luca

system · November 10, 2005, 3:36pm

[quote=“luca”]check in the notifications page who is being notified and hoiw often… then try to work down through contact groups and contacts… possibly some wrong settings there?

Luca[/quote]

Nagios is logging each notification as being sent to me. I don’t believe my contact or group configs are off (they’re pretty basic):
#Contacts.cfg
define contact{
contact_name nagios-admin
alias admin
service_notification_period 24x7
host_notification_period 24x7
service_notification_options w,u,c,r
host_notification_options d,r
service_notification_commands notify-by-email
host_notification_commands host-notify-by-email
email [me]@[companyname].com

#Contactgroups
define contactgroup{
contactgroup_name admins
alias Nagios Administrators
members nagios-admin
}

What I DID notice was that Nagios is logging 3 separate events each time the status of host changes (from OK to unreachable and back again). This is also the case with failed service checks. The particular machine I’m testing uses ping for check-host-alive.

luca · November 10, 2005, 4:13pm

last thing which comes to mind… is it always the samne host or do you have 3 similar names for the same machine? i know it sounds stupid but… sometimes it happens

Luca

system · November 10, 2005, 4:18pm

[quote=“luca”]last thing which comes to mind… is it always the samne host or do you have 3 similar names for the same machine? i know it sounds stupid but… sometimes it happens

Luca[/quote]

Hehehe
No, it’s the same server, each time.
:idea: While scrolling through other topics in this forum something caught my eye: minimal.cfg, the minimal needed to start nagios right off the bat. It stuck me that perhaps I’ve got some redundant notification commands in other files. I won’t pretend to be some Nagios guru, but I thought I’d see what happens when I remove minimal.cfg from my /etc folder. Now, I get only TWO notifications, as opposed to three. So perhaps I’m a step in the right direction here.

On one of our larger services, where we’ve got multiple services being monitored, we received 5 or 6 notifications. I’m going to test the difference now that I’ve removed this minimal config file…
Edited Thu Nov 10 2005, 08:19AM ]

luca · November 10, 2005, 4:54pm

strange… nagios check should at least WANR you that you have duplicate definitions…
Let us know

Luca

jakkedup · November 10, 2005, 4:55pm

[quote=“SonOfThunder”]
I’m getting notified multiple times of the HTTP service check on one of our hosts. On a separate host I have ping set as the only service checked. When it goes down and recovers, I only get one notification. [/quote]

When a check fails and recovers, you really do want to be notified for each event, so getting only ONE notification is bad. Change your .cfgs to fix that and include the “r”.

Getting notified multiple times for a service check means I have that check defined more than once, multiple instances of nagios running, or I have notification period set to some incredibly low setting.

It appears you have installed from some RPM since you stated something about the /etc directory, which is not where nagios is installed from source. So that right there is your first mistake, since we don’t have access to the installation instructions for your particular RPM package.

I strongly suggest that if you are going to use nagios for any amount of time, that you ditch the RPM and install from source. That way, you have installation instructions that are 100% correct.

PS: unless you are on the nagios beta testing team or are a nagios expert, you should NOT be installing the beta, since you will run into multitudes of little problems and will be constantly wondering what is wrong.

Your main objective should be to get nagios running, and later, if you get comfortable with it, then try installing the beta.

Edited Thu Nov 10 2005, 08:55AM ]

jakkedup · November 10, 2005, 4:59pm

[quote=“SonOfThunder”]
What I DID notice was that Nagios is logging 3 separate events each time the status of host changes (from OK to unreachable and back again). This is also the case with failed service checks. [/quote]

This it the normal behavior of nagios. If I have my service setup to notify on "d, u, r, then that is what I’m going to get, and that is what you are getting. So what is the problem?

If you don’t want to be notified of DOWN, then don’t include the “d” or likewise for unreachable, or recovered.

system · November 10, 2005, 5:02pm

[quote=“luca”]strange… nagios check should at least WANR you that you have duplicate definitions…
Let us know

Luca[/quote]

The only warning I get when running nagios -v nagios.cfg is that my notifcation interval is shorter than my check interval, but that’s for a separate server that’s being monitored, so I don’t imagine it’d be causing problems with this one.

system · November 10, 2005, 5:07pm

[quote=“jakkedup”]This it the normal behavior of nagios. If I have my service setup to notify on "d, u, r, then that is what I’m going to get, and that is what you are getting. So what is the problem?

If you don’t want to be notified of DOWN, then don’t include the “d” or likewise for unreachable, or recovered.
[/quote]

Oh, okay. I thought that was something that would hint as to why I’m getting so many notifications.

system · November 10, 2005, 5:11pm

[quote=“jakkedup”]

SonOfThunder:

I’m getting notified multiple times of the HTTP service check on one of our hosts. On a separate host I have ping set as the only service checked. When it goes down and recovers, I only get one notification. [/quote]

When a check fails and recovers, you really do want to be notified for each event, so getting only ONE notification is bad. Change your .cfgs to fix that and include the “r”.

Getting notified multiple times for a service check means I have that check defined more than once, multiple instances of nagios running, or I have notification period set to some incredibly low setting.

It appears you have installed from some RPM since you stated something about the /etc directory, which is not where nagios is installed from source. So that right there is your first mistake, since we don’t have access to the installation instructions for your particular RPM package.

I strongly suggest that if you are going to use nagios for any amount of time, that you ditch the RPM and install from source. That way, you have installation instructions that are 100% correct.

PS: unless you are on the nagios beta testing team or are a nagios expert, you should NOT be installing the beta, since you will run into multitudes of little problems and will be constantly wondering what is wrong.

Your main objective should be to get nagios running, and later, if you get comfortable with it, then try installing the beta.

I was messed up with my wording there. I meant to say I only got one notification e-mail for each time it went down and each time it recovered on that particular server. But that’s neither here nor there; I discovered I was mistaken about that and was actually getting multiple notifications for that host as well.

As far as my method of install: I used one of the mirror sites linked from nagios.org, but not the RPM sites. And the nagios documentation on sourceforge specifically mentions the /etc directory being where my configuration files are kept, and that’s how I have it set up.

jakkedup · November 10, 2005, 6:05pm

nagios.sourceforge.net/docs/2_0/installing.html

mkdir /usr/local/nagios
cd /usr/local/nagios
You should see five different subdirectories.

That from the docs, there is NO /etc, but a subdirectory in /usr/local/nagios called etc/

system · November 10, 2005, 6:41pm

[quote=“jakkedup”]http://nagios.sourceforge.net/docs/2_0/installing.html

mkdir /usr/local/nagios
cd /usr/local/nagios
You should see five different subdirectories.

That from the docs, there is NO /etc, but a subdirectory in /usr/local/nagios called etc/[/quote]

Oh yeah, that’s where my /etc directory is located. Sorry, I was assuming we were both on the same page about that.

Anywhos, I found out under the Event Log page that, for some reason, when I unplug the network cable to the server, nagios detects the failed plugin after 10 seconds, and then starts 10 ping attempts (I have max retry attempts set to 10). But after ping number 3 or 4, something else is apparently telling nagios to start over again. It’s kinda hard to explain, so I’ll show you:

[11-10-2005 13:23:42] HOST NOTIFICATION: nagios-admin;www-03;DOWN;host-notify-by-email;CRITICAL - Host Unreachable (www-03.[me].com)
[11-10-2005 13:23:42] HOST ALERT: www-03;DOWN;HARD;10;CRITICAL - Host Unreachable (www-03.me.com)
[11-10-2005 13:23:39] HOST ALERT: www-03;DOWN;SOFT;9;CRITICAL - Host Unreachable (www-03.me.com)
[11-10-2005 13:23:36] HOST ALERT: www-03;DOWN;SOFT;8;CRITICAL - Host Unreachable (www-03.me.com)
[11-10-2005 13:23:33] HOST ALERT: www-03;DOWN;SOFT;7;CRITICAL - Host Unreachable (www-03.me.com)
[11-10-2005 13:23:30] HOST ALERT: www-03;DOWN;SOFT;6;CRITICAL - Host Unreachable (www-03.me.com)
[11-10-2005 13:23:27] SERVICE ALERT: www-03;Ping;CRITICAL;HARD;1;CRITICAL - Plugin timed out after 10 seconds
[11-10-2005 13:23:26] HOST ALERT: www-03;DOWN;SOFT;5;CRITICAL - Host Unreachable (www-03.me.com)
[11-10-2005 13:23:26] HOST NOTIFICATION: nagios-admin;www-03;DOWN;host-notify-by-email;CRITICAL - Host Unreachable (www-03.me.com)
[11-10-2005 13:23:26] HOST ALERT: www-03;DOWN;HARD;10;CRITICAL - Host Unreachable (www-03.me.com)
[11-10-2005 13:23:23] HOST ALERT: www-03;DOWN;SOFT;9;CRITICAL - Host Unreachable (www-03.me.com)
[11-10-2005 13:23:23] HOST ALERT: www-03;DOWN;SOFT;4;CRITICAL - Host Unreachable (www-03.me.com)
[11-10-2005 13:23:20] HOST ALERT: www-03;DOWN;SOFT;8;CRITICAL - Host Unreachable (www-03.me.com)
[11-10-2005 13:23:20] HOST ALERT: www-03;DOWN;SOFT;3;CRITICAL - Host Unreachable (www-03.me.com)
[11-10-2005 13:23:17] HOST ALERT: www-03;DOWN;SOFT;2;CRITICAL - Host Unreachable (www-03.me.com)
[11-10-2005 13:23:17] HOST ALERT: www-03;DOWN;SOFT;7;CRITICAL - Host Unreachable (www-03.me.com)
[11-10-2005 13:23:14] HOST ALERT: www-03;DOWN;SOFT;6;CRITICAL - Host Unreachable (www-03.me.com)
[11-10-2005 13:23:14] HOST ALERT: www-03;DOWN;SOFT;1;CRITICAL - Host Unreachable (www-03.me.com)
[11-10-2005 13:23:11] HOST ALERT: www-03;DOWN;SOFT;5;CRITICAL - Host Unreachable (www-03.me.com)
[11-10-2005 13:23:08] HOST ALERT: www-03;DOWN;SOFT;4;CRITICAL - Host Unreachable (www-03.me.com)
[11-10-2005 13:23:05] HOST ALERT: www-03;DOWN;SOFT;3;CRITICAL - Host Unreachable (www-03.me.com)
[11-10-2005 13:22:57] HOST ALERT: www-03;DOWN;SOFT;2;CRITICAL - Plugin timed out after 10 seconds
[11-10-2005 13:22:47] HOST ALERT: www-03;DOWN;SOFT;1;CRITICAL - Plugin timed out after 10 seconds

So it looks like nagios runs another check-host-alive command sequence
of 10 pings half-way through the first, which leads to my getting multiple notifications. Before I had removed minimal.cfg from my /usr/local/nagios/etc/ directory, which contained another command definition for check-host-alive, it would start a third one before the other two had finished and I would get three notifications.

jakkedup · November 10, 2005, 7:37pm

OK, it’s etc/ not /etc, big difference.
So let’s say i have in my nagios.cfg this:
interval_length=60 #which means 60 seconds per ONE interval.
then let’s say i have in my services.cfg this:
normal_check_interval=5 #which means 5 x 60 =300 seconds
and also in services.cfg I have this:
max_check_attempts=10
retry_check_interval=1 #which means 1 x 60 = 60 seconds.

If the check fails, it will attempt 10 times to check the service and according to your output, it takes 10 seconds for the check to timeout. According to my setting of retry_check_interval=1, it will retry the check every 60 seconds. So…
Check fails at t=0
check will repeat for 10 times every 60 seconds.
t=60 and check repeats and fails
t=120 and check repeats and failes
t=180…
t=240…
t=300… ahhh, a magic number right? normal_check_interval=5 which is 300 seconds so my normal check scheduling kicks in, and it start another normal ping check.

So you see, I have to ask you, why would you want to check this event 10 times, before you decide that it just isn’t going to ping? That is the problem.
max_check_attempts=3 should be sufficient.
So, with max_check_attempts=3 it would go like this.
t=0 check fails
t=60 fails again on retry#1
t=120 fails again on retry #2
t=180 fails again on retry #3 and since it has reached max_check_attempts it would then go into a hard state and send out the email.
t=240 this service is not checked again yet.
t=300 and bam, the normal_check_interval=5 kicks in, and performs the service check again.

system · November 10, 2005, 9:01pm

Hm, I think I understand what you’re saying there…but changing max_check_attempts to 3 didnt fix the problem. Anywhos, what I was trying to point out was, not that the ping attempted 10 times, but that it checked an overall of 20. I bolded some of the output to show where a second check-host-alive command was being executed. The non-bolded output showed the first check-host-alive command that hadn’t yet finished before this one started up, and at the end of it all I receive 2 notification e-mails.

Here’s the relevant part of my services.cfg file, if that helps at all:

check_period 24x7
max_check_attempts 3
normal_check_interval 60
retry_check_interval 60
contact_groups admins
notification_interval 7200
notification_period 24x7
notification_options w,u,c,r

P.S. And yes, I did restart nagios after adjusting the max_check_interval.
Edited Thu Nov 10 2005, 01:05PM ]

system · November 10, 2005, 9:16pm

Hm…just found that my boss set the interval_length to 1. Interesting…

Edited Thu Nov 10 2005, 01:24PM ]

jakkedup · November 10, 2005, 9:28pm

interval_length=60 in my nagios.cfg file by default so…

normal_check_interval 60 huh?
So you are going to check each service 60 x 60 = 3600 seconds?
retry_check_interval 60 huh? So you are going to retry a check after 60 x 60 = 3600 seconds?
If so, then you will retry a check at appoximately the same time that you will be performing another normal check.

I see what is going on here. You need to read the documentation at least once. It’s not going to do anyone any good for me to spoon feed you the solutions. You need to THINK about the settings and what is going to happen if you change things from the defaults.
normal_check_interval 5 was default.
retry_check_interval 1 was default
why you had to change from the default’s, I don’t know, but by doing so, you have created your problems.