I’ve got Nagios running on two different servers, all monitoring the same exact hosts/services. Over the holiday weekend we had some trouble with our network and received no notifications. This caused a network outage that lasted at least three hourse, which leads to unhappy custotmers, loss of money, etc…you know the drill. I know that the first one might have detected the outage but might not have sent any notifications out because the ISP and mail server it uses are really flaky at times so I disabled notifications for all hosts/services on it (otherwise we get notifications about problems every couple of minutes and it loads up my inbox, and that of about a dozen other people). The second Nagios box, I discovered, hadn’t been actually running since the 13th of December. I (stupidly) assumed that because everything looked okay on the web interface, that everything was working properly. In actuality, I believe the cause was some power outages we’ve been having, and I’d forgotten to configure my system to start Nagios after every reboot…
However, if the first nagios box could monitor the second Nagios service, then we’d have known about it sooner. Thing is, I’m only running a ping check on the server my second Nagios service is on, so while the server itself was up, Nagios was not, and we didn’t know about it. What I’m looking to do is see if I can use a plugin to monitor the actual service, make sure Nagios is running and return a notification when it isn’t. Any ideas for using a pre-existing plugin to do this, or would I need to write my own plugin? Edited ]
If you wanted to write your own plugin, I believe that the nagios process id is stored in <nagios_root>/var/nagios.lock. The status.log file in the same directory can be useful as well.
i’m having the nagios machine monitor itself…
a cehck in crontab looks for the time of the last executed check on a particular service (running every 5 minutes). if it’s more than 15 or 30 minutes back i assume nagios is hanging and launch a stop, kill, start script.
Thanks for the help, guys. Upon a bit of further investigation (which I should’ve done before posting here), I found I can use the check_procs plugin with the -a option, so it will search for the process string that shows the running of the Nagios daemon. Now I just gotta get NRPE of NSCA running and I’ll be all set (I’m having an issue with that because of my Iptables, and I’m don’t know yet how to edit that).
It’s not quite as nice a solution as Luca’s, though…that’s actually pretty cool, as long as we could work in some way for Nagios to still send out notifications, say, if Nagios fails to restart. Edited Wed Dec 28 2005, 04:09PM ]
"we had some trouble with our network and received no notifications"
Yes, that can be a problem. But if you could have logged into the nagios machine and looked at the nagios website of yours, then you should have been able to tell what is broken on your network . Oh, you aren’t monitoring your network equipment you say? You say you are monitoring hosts and services though right? And those hosts and services aren’t worth a dime, if they have no net to run on. So, I’d suggest, that you spend some quality time tracing cables/switches/routers/if cards and make your “status map” look IDENTICAL to the way your network is wired up PHYSICALLY.
Now that you have done that, your nagios machine will show that HOST “router port e52” is DOWN and all HOSTS after that will show as “Unreachable”. Since i don’t allow any unreachable alerts, I get only ONE alert and that is “DOWN”. That is how you get rid of those nagios spams into your “INBOX”. The only reason you get so many alerts, is because you have no dependancy setups or parent/child relationships.
Bottom line is this. Nagios isn’t worth a dime, if you are not watching those interface cards, switch port that nagios is plugged into, and the rest of the network on the path to the ACTUAL service that you are very very interested in.
For example, the only thing I might want nagios for is to watch the status of httpd on one of our servers. If the net goes down, then I’m going to see that the service check failed and the host failed. But if I have the network mapped out, and nagios is checking every connection from my nagios box, all the way to the web server, then if the net goes down, I’ll see that the httpd box is in state “unreachable” and that switch port 16 interface status is “DOWN”. Since port 16 is the port that the web server is plugged into, I look and find that the cable is unplugged. Net is back up in less than 10 minutes.
The above example is not a far fetched thing. That is why I installed nagios in the first place. Our network went down, and we spent 45 minutes, ping this, ping that, then finally we find a likely problem, and unplug and plug back in a cable plugged into a switch, and problem solved( poor connection in switch port or dirty contacts).
So now, I can fix the same problem in 10 minutes tops.