Event handler doesn't kick off


#1

I’ve created an event handler to start a specific service on a Windows box (using ‘net rpc’, which is awesome), and all the scripting works fine when done manually, and the (CRITICAL/HARD/5) state of the service appears that it should kick off the event handler, but it just doesn’t seem to do it. I’m hoping someone here will help me stop beating my head against the wall :!:

From commands.cfg:

define command{
        command_name    start-ndsserver
        # "6" is chosen here to be greater than the 5-before-notification level set for the service, to avoid pre-notification start attempts, and we don't want restarts on any soft failure
        command_line    $USER1$/event_handlers/run_win_process.sh $HOSTNAME$ "\"NDS Server0\"" start $SERVICESTATE$ $STATETYPE$ $SERVICEATTEMPT$ 6
        }

From the Nagios config:

define service{
 use generic-service
 contact_groups Windows-Admins
 host_name servername01
 service_description eDir Login
 servicegroups eDirectory-XYZTREE
 check_command check_nwauth!
 _webname eDir%20Login
 process_perf_data 0
 event_handler start-ndsserver
 flap_detection_enabled 0
}

The script, which definitely does work on the command line and is definitely located as specified above:

#!/bin/bash
# Script to start or stop services on a Wink2k3 server via the Remote Procedure Call service
#
# Altered form of the example "HTTPD" script from the Nagios docs
#
# $1 = serverName
# $2 = serviceName
# $3 = serviceAction
# $4 = $SERVICESTATE$
# $5 = $STATETYPE$
# $6 = $SERVICEATTEMPT$
# $7 = the number of "soft" (pre-notification) checks before sending a start command. If this is higher than the number of checks configured before sending notification, this won't
#      ever be done (meaning, that this auto-recovery won't hide an ongoing issue)

# What server?
serverName=$1

# What service?
serviceName=$2

# What action?
serviceAction=$3

function serviceAction {
        serverName=$1
        serviceName=$2
        serviceAction=$3
        expect -c "spawn -noecho /usr/bin/net rpc service $serviceAction $serviceName -U $userName -S $serverName" \
                -c 'expect Password:' \
                -c "send $userPass\n" \
                -c 'expect eof'
}

# Get authentication from secret file
userName=""
userPass=""
while read input
do
        if  -z "$userName" ]
        then
                export userName=`echo $input | awk -F "=" '/userName/ { print $2 }' | sed 's/"//g'`
        fi

        if  -z "$userPass" ]
        then
                export userPass=`echo $input | awk -F "=" '/userPass/ { print $2 }' | sed 's/"//g'`
        fi
done < /home/nagios/.secret

# What state is the service in?
case "$4" in
OK)
        # The service just came back up, so don't do anything...
        ;;
WARNING)
        # We don't really care about warning states, since the service is probably still running...
        ;;
UNKNOWN)
        # We don't know what might be causing an unknown error, so don't do anything...
        ;;
CRITICAL)
        # Aha!  The service appears to have a problem - perhaps we should restart the server...

        # Is this a "soft" or a "hard" state?
        case "$5" in

        # We're in a "soft" state, meaning that Nagios is in the middle of retrying the
        # check before it turns into a "hard" state and contacts get notified...
        SOFT)

                # What check attempt are we on?  We don't want to restart the service on the first check, because it may just be a fluke!
                # Wait a few times to see if this is a transient (ie. network) error
                case "$6" in
                $7)
                        # Action for service
                        echo $serverName "$serviceName" $serviceAction
                        serviceAction $serverName "$serviceName" $serviceAction
                        ;;
                        esac
                ;;

        # It should have been restarted by the code above, but for some reason it didn't (for example, the soft-caused service start above was bypassed)
        HARD)
                # Action for service
                echo $serverName "$serviceName" $serviceAction
                serviceAction $serverName "$serviceName" $serviceAction
                ;;
        esac
        ;;
esac

# Clear userPass
set userPass = ""

exit 0

And a snippet from the nagios log showing the state of the service:

[1269301766] SERVICE ALERT: servername01;eDir Login;CRITICAL;HARD;5;FAIL

Any help appreciated. Thanks!