Only 1 alert per host if >1 service out + delaying notify

Ok, I have two problems here…

  1. If we have a host go down, we get something like 24+ notifications (SMS messages) which got so annoying that our tech’s disabled SMS notifications all together, now when something goes down they don’t get notified at all unless they have their monitoring screen up… We’ve had 2 outages in the past 3 days that took them longer than 10 hours to respond because it was the middle of the night and nobody was paged. The reason for so many notifications is because we have virtual servers running on each host, and each virtual server has a half dozen or so services running on it. Is there some way to say “All of these services belong to this host, and if more than X of these services are down at any given time, then just send a single notification for the whole group?”

  2. The other problem we are having is that our backup software sometimes takes the virtual hosts offline for “JUST” long enough for Nagios to kick in and send out a notice that the virtual machine is down, and then almost immediately sends a notification saying it’s back up again. Is there any way to say that these specific hosts that are taking too long to run their backups are allowed to be offline for X minutes longer than normal?

P.S. I’m sure you can tell, I am NOT a Nagios master… :wink: Sorry if the questions are stupid or common, I tried searching in the forums and couldn’t find any related posts.

It never fails… So I ask the question and almost immediately afterwards (I think) I found the answer…

Is it as simple as editing all of the cfg files for the virtual hosts and putting a "Parents " in each host config?

if a host goes down you shouldn’t be paged for the services being down
if multiple Vhosts depend on a physical host use parents and don’t send notifications on the unreachable status.
For the backups. either use timeperiods or let it have an extra check or two before notyfing.

Hope some of this makes sense and helps :slight_smile:

Hmm… Doesn’t seem to be working the way I had thought it would… I have

max_check_attempts = 6
interval_length=60
normal_check_interval = 5

But I get notified immediately if a host is down, I am running Nagios 3.0.6

stop nagios, check for surviving processes, kill, start nagios and check again