Reboot of Server stop notifications?


#1

Is there a way to say that if you have x number of services all failing to either send only one notification for all of them (if the groups are the same) or have it test to see if the host is up, and only send the host down message?

I thougth I had everything working the way I wanted then did a reboot, and got a lot of email :frowning:


#2

you should either setup host/service dependancies or define parents in your hosts.cfg file.
You see, if nagios makes a check of a service and it fails, it then checks to see if the parent is down. If the parent fails, it checks it’s parent and so on, until it finds a parent that is “UP”. So now nagios knows, that the problem is not every single host, but just one host that is blocking all the others from working. So now, since your contacts.cfg says you only want c, w, r you don’t get the “unreachable” emails. But since you don’t have it setup like that, then how would nagios know what is important and what is not?

I fail to see the case that our httpd server is going to go down at the same time as the other server running ftp. If it does, then surely you want to know this don’t you? If the problem is not actually the ftp or httpd server but only a network problem, then why haven’t you defined your network as part of nagios? Nagios is not just to monitor a bunch of PC’s. Those pcs are plugged into the netork, so you should be monitoring that also.

If you have setup your nagios the way I"ve described in dozens of threads in this forum, like the “adding switches” then you won’t have the problem you describe.

But to answer your question flat out, NO. Nobody is going to want a tool like nagios, that will only send out one email when 12 hosts are down, if in fact there are 12 problems. But, if there is actually only ONE problem, but it is blocking 11 other hosts from working, then that is due to you not configuring nagios as I’ve described.


#3

ps, why would a reboot show all of your hosts down? Did the checks actually fail? They must have failed since you got alerts. Does it show this behavior each and every time you reboot?
If so, change nagios.cfg
retain_state_information=1
state_retention_file=/usr/local/nagios/var/status.sav
use_retained_program_state=1