I’ve created an event handler to start a specific service on a Windows box (using ‘net rpc’, which is awesome), and all the scripting works fine when done manually, and the (CRITICAL/HARD/5) state of the service appears that it should kick off the event handler, but it just doesn’t seem to do it. I’m hoping someone here will help me stop beating my head against the wall :!:
From commands.cfg:
define command{
command_name start-ndsserver
# "6" is chosen here to be greater than the 5-before-notification level set for the service, to avoid pre-notification start attempts, and we don't want restarts on any soft failure
command_line $USER1$/event_handlers/run_win_process.sh $HOSTNAME$ "\"NDS Server0\"" start $SERVICESTATE$ $STATETYPE$ $SERVICEATTEMPT$ 6
}
From the Nagios config:
define service{
use generic-service
contact_groups Windows-Admins
host_name servername01
service_description eDir Login
servicegroups eDirectory-XYZTREE
check_command check_nwauth!
_webname eDir%20Login
process_perf_data 0
event_handler start-ndsserver
flap_detection_enabled 0
}
The script, which definitely does work on the command line and is definitely located as specified above:
#!/bin/bash
# Script to start or stop services on a Wink2k3 server via the Remote Procedure Call service
#
# Altered form of the example "HTTPD" script from the Nagios docs
#
# $1 = serverName
# $2 = serviceName
# $3 = serviceAction
# $4 = $SERVICESTATE$
# $5 = $STATETYPE$
# $6 = $SERVICEATTEMPT$
# $7 = the number of "soft" (pre-notification) checks before sending a start command. If this is higher than the number of checks configured before sending notification, this won't
# ever be done (meaning, that this auto-recovery won't hide an ongoing issue)
# What server?
serverName=$1
# What service?
serviceName=$2
# What action?
serviceAction=$3
function serviceAction {
serverName=$1
serviceName=$2
serviceAction=$3
expect -c "spawn -noecho /usr/bin/net rpc service $serviceAction $serviceName -U $userName -S $serverName" \
-c 'expect Password:' \
-c "send $userPass\n" \
-c 'expect eof'
}
# Get authentication from secret file
userName=""
userPass=""
while read input
do
if -z "$userName" ]
then
export userName=`echo $input | awk -F "=" '/userName/ { print $2 }' | sed 's/"//g'`
fi
if -z "$userPass" ]
then
export userPass=`echo $input | awk -F "=" '/userPass/ { print $2 }' | sed 's/"//g'`
fi
done < /home/nagios/.secret
# What state is the service in?
case "$4" in
OK)
# The service just came back up, so don't do anything...
;;
WARNING)
# We don't really care about warning states, since the service is probably still running...
;;
UNKNOWN)
# We don't know what might be causing an unknown error, so don't do anything...
;;
CRITICAL)
# Aha! The service appears to have a problem - perhaps we should restart the server...
# Is this a "soft" or a "hard" state?
case "$5" in
# We're in a "soft" state, meaning that Nagios is in the middle of retrying the
# check before it turns into a "hard" state and contacts get notified...
SOFT)
# What check attempt are we on? We don't want to restart the service on the first check, because it may just be a fluke!
# Wait a few times to see if this is a transient (ie. network) error
case "$6" in
$7)
# Action for service
echo $serverName "$serviceName" $serviceAction
serviceAction $serverName "$serviceName" $serviceAction
;;
esac
;;
# It should have been restarted by the code above, but for some reason it didn't (for example, the soft-caused service start above was bypassed)
HARD)
# Action for service
echo $serverName "$serviceName" $serviceAction
serviceAction $serverName "$serviceName" $serviceAction
;;
esac
;;
esac
# Clear userPass
set userPass = ""
exit 0
And a snippet from the nagios log showing the state of the service:
[1269301766] SERVICE ALERT: servername01;eDir Login;CRITICAL;HARD;5;FAIL
Any help appreciated. Thanks!