Ok, I have two problems here…
-
If we have a host go down, we get something like 24+ notifications (SMS messages) which got so annoying that our tech’s disabled SMS notifications all together, now when something goes down they don’t get notified at all unless they have their monitoring screen up… We’ve had 2 outages in the past 3 days that took them longer than 10 hours to respond because it was the middle of the night and nobody was paged. The reason for so many notifications is because we have virtual servers running on each host, and each virtual server has a half dozen or so services running on it. Is there some way to say “All of these services belong to this host, and if more than X of these services are down at any given time, then just send a single notification for the whole group?”
-
The other problem we are having is that our backup software sometimes takes the virtual hosts offline for “JUST” long enough for Nagios to kick in and send out a notice that the virtual machine is down, and then almost immediately sends a notification saying it’s back up again. Is there any way to say that these specific hosts that are taking too long to run their backups are allowed to be offline for X minutes longer than normal?
P.S. I’m sure you can tell, I am NOT a Nagios master… Sorry if the questions are stupid or common, I tried searching in the forums and couldn’t find any related posts.