Hi,
I’ve used to use Netsaint a long time ago and have recently started to use Nagios to monitor a small number of servers. Everything is installed and working as expected and I have a script setup to notify me via SMS when something’s wrong.
I have an ADSL router which, even though it’s on a UPS, sometimes locks up (the lock-ups appeared to be related to power fluctuations but they still happen :()
I also have an APC MasterSwitch so I’m intending to put the ADSL router on the MasterSwitch and power cycle it if Nagios determines it’s down.
I’m getting this event to fire when the host is down but when the check_command against the host object fails there’s no delay between retries which may mean that the router would be power cycled even if it failed to respond for a couple of seconds.
My host is defined as:
define host {
use generic-host
host_name router
address a.b.c.d
parents gatekeeper
check_command check-host-alive
contact_groups admins
event_handler powercycle-device!e.f.g.h!private!6!i!7
}
and the event handler is:
Power cycle a device connected to an APC MasterSwitch
define command{
command_name powercycle-device
command_line /usr/local/nagios/libexec/eventhandlers/powercycle-device $HOSTSTATE$ $HOSTSTATETYPE$ $HOSTATTEMPT$ $ARG1$ $ARG2$ $ARG3$ $ARG4$ $ARG5$
}
I’m getting the following is an extract from the log file and shows the event being fired rapidly:
[1165967490] HOST EVENT HANDLER: router;DOWN;SOFT;1;powercycle-device!e.f.g.h!private!6!i!7
[1165967490] HOST ALERT: router;DOWN;SOFT;2;CRITICAL - Host Unreachable (a.b.c.d)
[1165967490] HOST EVENT HANDLER: router;DOWN;SOFT;2;powercycle-device!e.f.g.h!private!6!i!7
[1165967490] HOST ALERT: router;DOWN;SOFT;3;CRITICAL - Host Unreachable (a.b.c.d)
[1165967490] HOST EVENT HANDLER: router;DOWN;SOFT;3;powercycle-device!e.f.g.h!private!6!i!7
[1165967490] HOST ALERT: router;DOWN;SOFT;4;CRITICAL - Host Unreachable (a.b.c.d)
[1165967490] HOST EVENT HANDLER: router;DOWN;SOFT;4;powercycle-device!e.f.g.h!private!6!i!7
[1165967491] HOST ALERT: router;DOWN;HARD;5;CRITICAL - Host Unreachable (a.b.c.d)
[1165967491] HOST NOTIFICATION: admin;router;DOWN;host-notify-by-sms;CRITICAL - Host Unreachable (a.b.c.d)
[1165967491] HOST EVENT HANDLER: router;DOWN;HARD;5;powercycle-device!e.f.g.h!private!6!i!7
The powercycle-device script will only tell the MasterSwitch to reboot the port when the status is DOWN/HARD.
Is there any way I can ensure that it waits a sensible amoung of time before cycling the router’s power?
I’ve considered doing this on a service, using check_ping and not having a host check_command but then I’ll get incorrect notifications about devices that are the other side of the router!
Any advice would be appreciated. Many thanks,
Zeb