Hi guys, I am setting up failover monitoring and need some advice if I am doing things right. We currently monitor our entire datacenter/wan using a single nagios box. I just installed a secondary soon to be failover host in our failover datacenter. I have been reading the failover documentation and wrote a quick script below:
[code]# Purpose: Check if master Nagios process is up for failover.
STATUS=/usr/local/nagios/libexec/check_nrpe -H nagios -c check_nagios
RESULT=$?
if $RESULT -ne 0 ] # if nagios is not running
then
echo “Nagios Master is down enabling active checks and notifications on $HOSTNAME” | /bin/mail -s “Nagios Master down” [email protected]
/usr/local/nagios/libexec/eventhandlers/enable_active_service_checks
/usr/local/nagios/libexec/eventhandlers/enable_notifications
else
echo “Nagios Master is up disabling active checks and notifications on $HOSTNAME”
/usr/local/nagios/libexec/eventhandlers/disable_active_service_checks
/usr/local/nagios/libexec/eventhandlers/disable_notifications
fi
[/code]
The script would run every minute on the secondary slave and check if the master Nagios is up. If not it will trigger: “enable_active_service_checks”
[code]#!/bin/sh
Write a command to the Nagios command file to cause
it to enable active service checks. This can be
referred to as ‘active’ mode in a redundant monitoring
environment.
Notes:
1) This script is not intended to be used as an
event handler by itself. Instead, it is used by other
event handler scripts (like the redundancy examples).
2) In order for Nagios to process any commands that
are written to the command file, you must enable
the check_external_commands option in the main
configuration file.
printfcmd="/usr/bin/printf"
CommandFile="/var/nagios/rw/nagios.cmd"
get the current date/time in seconds since UNIX epoch
datetime=date +%s
pipe the command to the command file
$printfcmd "%i] START_EXECUTING_SVC_CHECKS\n" $datetime >> $CommandFile
[/code]
and the “enable_notifications” script. My concern is if there is a better solution than this. Since if Nagios up and running fine I will constantly be running the “disable_active_service_checks”
[code]#!/bin/sh
Write a command to the Nagios command file to cause
it to disable active service checks. This can be
referred to as ‘standby’ mode in a redundant monitoring
environment.
Notes:
1) This script is not intended to be used as an
event handler by itself. Instead, it is used by other
event handler scripts (like the redundancy examples).
2) In order for Nagios to process any commands that
are written to the command file, you must enable
the check_external_commands option in the main
configuration file.
printfcmd="/usr/bin/printf"
CommandFile="/var/nagios/rw/nagios.cmd"
get the current date/time in seconds since UNIX epoch
datetime=date +%s
pipe the command to the command file
$printfcmd "%i] STOP_EXECUTING_SVC_CHECKS\n" $datetime >> $CommandFile
[/code]
and the disable_notifications script:
[code]#!/bin/sh
Write a command to the Nagios command file to cause
it to disable host and service notifications
Notes:
1) This script is not intended to be used as an
event handler by itself. Instead, it is used by other
event handler scripts (like the redundancy examples).
2) In order for Nagios to process any commands that
are written to the command file, you must enable
the check_external_commands option in the main
configuration file.
printfcmd="/usr/bin/printf"
CommandFile="/var/nagios/rw/nagios.cmd"
get the current date/time in seconds since UNIX epoch
datetime=date +%s
pipe the command to the command file
$printfcmd "%i] DISABLE_NOTIFICATIONS;%i\n" $datetime $datetime >> $CommandFile
[/code]
Is this how everyone is implementing failover? Is there a way to check if notifications are enabled or not via CLI, and make some decisions based on this logic? Also to keep the status and host info in sync would it make sense to simpley scp status.dat and retention.dat files over to the remote host?
Thanks for any input if possible
Don