Alert based on number of services failed


#1

Not sure if I’m posting this in the correct forum. This might be something to think about for future releases, if it’s not already implemented. Is there a way to notify based on a user determined number of services are in a user determined state. For example. There is a host that has 50 different services associated with it. If one of those services goes into a critical state, there is no notification. But if it hits a threshold of say 5 services in a critical state, we send out a notification. It would even be nice to adjust this based on timeperiods. During the day, notify us if 1 service is in critical state, but at night, make it 5 services in a critical state.

Here is my reasoning/thought process behind this request. I work for a small privately owned cable company. Nagios works well for about 90% of our monitoring. We have one Server that monitors all our channels and sends out an SNMP trap to nagios when it detects a problem with a channel. I have a service setup in nagios for EVERY channel we monitor. But, I don’t want to be notified in the middle of the night if 1 channel goes out, but if we lose a bunch of channels, that is a problem and I want to be notified. Hopefully someone has some ideas! Also, it may be nice to have a summary email sent in the morning of all problems of the previous night that didn’t meet the criteria for notification, but where not in an OK state.

Thanks.


#2

I don’t think there is such option in Nagios. What you could do is write a script that would check all services on that host (you can use the Nagios check commands for easier handling of check outputs) and return non-OK state only if there are 5 or more then 5 checks in critical state.

When you have script done, put it as a service check command for host and set the notification_period to some night time period. For all other service checks that check every service separately, set the notification_period to daily time period. Or if you don’t want to overload the server with so many checks, you can define check_period for those services. For a service that checks for 5 or more critical states to be ran at night and for all other services to be checked at daytime.


#3

The problem is, there is no way to actively check the services on the host. It only uses passive checks via SNMP. You did give me an idea, however. I could write an intermediate DB or program that would store the last state of the passive check and have nagios poll that actively. But, we have also decided that we did not want to implement this now.