Nagios Failover

Hi guys, I am setting up failover monitoring and need some advice if I am doing things right. We currently monitor our entire datacenter/wan using a single nagios box. I just installed a secondary soon to be failover host in our failover datacenter. I have been reading the failover documentation and wrote a quick script below:

[code]# Purpose: Check if master Nagios process is up for failover.

STATUS=/usr/local/nagios/libexec/check_nrpe -H nagios -c check_nagios
RESULT=$?

if $RESULT -ne 0 ] # if nagios is not running
then
echo “Nagios Master is down enabling active checks and notifications on $HOSTNAME” | /bin/mail -s “Nagios Master down” [email protected]
/usr/local/nagios/libexec/eventhandlers/enable_active_service_checks
/usr/local/nagios/libexec/eventhandlers/enable_notifications
else
echo “Nagios Master is up disabling active checks and notifications on $HOSTNAME”
/usr/local/nagios/libexec/eventhandlers/disable_active_service_checks
/usr/local/nagios/libexec/eventhandlers/disable_notifications
fi
[/code]

The script would run every minute on the secondary slave and check if the master Nagios is up. If not it will trigger: “enable_active_service_checks”

[code]#!/bin/sh

Write a command to the Nagios command file to cause

it to enable active service checks. This can be

referred to as ‘active’ mode in a redundant monitoring

environment.

Notes:

1) This script is not intended to be used as an

event handler by itself. Instead, it is used by other

event handler scripts (like the redundancy examples).

2) In order for Nagios to process any commands that

are written to the command file, you must enable

the check_external_commands option in the main

configuration file.

printfcmd="/usr/bin/printf"

CommandFile="/var/nagios/rw/nagios.cmd"

get the current date/time in seconds since UNIX epoch

datetime=date +%s

pipe the command to the command file

$printfcmd "%i] START_EXECUTING_SVC_CHECKS\n" $datetime >> $CommandFile

[/code]

and the “enable_notifications” script. My concern is if there is a better solution than this. Since if Nagios up and running fine I will constantly be running the “disable_active_service_checks”

[code]#!/bin/sh

Write a command to the Nagios command file to cause

it to disable active service checks. This can be

referred to as ‘standby’ mode in a redundant monitoring

environment.

Notes:

1) This script is not intended to be used as an

event handler by itself. Instead, it is used by other

event handler scripts (like the redundancy examples).

2) In order for Nagios to process any commands that

are written to the command file, you must enable

the check_external_commands option in the main

configuration file.

printfcmd="/usr/bin/printf"

CommandFile="/var/nagios/rw/nagios.cmd"

get the current date/time in seconds since UNIX epoch

datetime=date +%s

pipe the command to the command file

$printfcmd "%i] STOP_EXECUTING_SVC_CHECKS\n" $datetime >> $CommandFile
[/code]

and the disable_notifications script:

[code]#!/bin/sh

Write a command to the Nagios command file to cause

it to disable host and service notifications

Notes:

1) This script is not intended to be used as an

event handler by itself. Instead, it is used by other

event handler scripts (like the redundancy examples).

2) In order for Nagios to process any commands that

are written to the command file, you must enable

the check_external_commands option in the main

configuration file.

printfcmd="/usr/bin/printf"

CommandFile="/var/nagios/rw/nagios.cmd"

get the current date/time in seconds since UNIX epoch

datetime=date +%s

pipe the command to the command file

$printfcmd "%i] DISABLE_NOTIFICATIONS;%i\n" $datetime $datetime >> $CommandFile
[/code]

Is this how everyone is implementing failover? Is there a way to check if notifications are enabled or not via CLI, and make some decisions based on this logic? Also to keep the status and host info in sync would it make sense to simpley scp status.dat and retention.dat files over to the remote host?

Thanks for any input if possible

Don