We currently only have one instance of nagios running. We are looking to setup a 2nd node in a different location for redundancy purposes.
We are thinking of installing Nagios on it and then using rsync and some crons to update the slave server from our currently running server so we only have to add new hosts to a single box.
Has anyone done this before and how does it work? Also, it would be nice to set it up somehow so that only our main server sends out the notifications when it’s online or we’ll get twice the notifications for same failed services, which is a no-no. The secondary server could send out the alerts if the primary fails numerous checks from the secondary for uptime.
Hopefully I didn’t confuse anyone. If any of you guys/gals know of a plugin or a nice way of setting this up, please do speak up!
Heres what you need to know:
“Set up a cron job on the slave host that periodically (say every minute) runs a script that checks the staus of the Nagios process on the master host (using the check_nrpe plugin on the slave host and the nrpe daemon and check_nagios plugin on the master host). The script should check the return code of the check_nrpe plugin . If it returns a non-OK state, the script should send the appropriate commands to the external command file to enable both notifications and active service checks. If the plugin returns an OK state, the script should send commands to the external command file to disable both notifications and active checks.”
You’re going to want to look at “DISABLE_NOTIFICATIONS” , “STOP_EXECUTING_SVC_CHECKS” and their ENABLE and START counterparts. You’re going to be passing those commands to the nagios.cmd file in your cron job on the slave.
So heres a quick run down on what you need to do:
-Set up the slave nagios server with Rsync between the plugins directory and the config directory
-Set up NRPE on the master nagios box with a command that runs the check_nagios plugin.
-set up a cron on your slave which runs “check_nrpe -H master.server.net -c check_nagios” then parses those results. Make an if/then statement which will pass DISABLE_NOTIFICATIONS and STOP_EXECUTING_SVC_CHECKS to the nagios.cmd file if everything returns OK, or START_EXECUTING_SVC_CHECKS and ENABLE_NOTIFICATIONS if everything is not ok on the master.
If you don’t mind having them both do checks on all hosts, you dont have to bother with the NSCA part. otherwise read up on Distributed monitoring and how to use NSCA. Basically a rundown of that is in your nagios.cfg you define an “OCSP” command which runs after every single service check. This OCSP command points to a script which runs send_nsca which sends your service check results to another nagios server. If you have q’s about setting that up after reading the docs let me know
Thanks for that comprehensive walkthrough MP. I am currently waiting for a new node to come online to setup nagios on it as well.
I did check out Distributed Monitoring but that looks to be a solution if you’re monitoring thousands of hosts/services. I’m not quite up to that stage yet so I think the failover monitoring you mention above should be exactly what I need.