Host down event firing too soon


#1

Hi,

I’ve used to use Netsaint a long time ago and have recently started to use Nagios to monitor a small number of servers. Everything is installed and working as expected and I have a script setup to notify me via SMS when something’s wrong.

I have an ADSL router which, even though it’s on a UPS, sometimes locks up (the lock-ups appeared to be related to power fluctuations but they still happen :()

I also have an APC MasterSwitch so I’m intending to put the ADSL router on the MasterSwitch and power cycle it if Nagios determines it’s down.

I’m getting this event to fire when the host is down but when the check_command against the host object fails there’s no delay between retries which may mean that the router would be power cycled even if it failed to respond for a couple of seconds.

My host is defined as:

define host {
use generic-host

host_name router
address a.b.c.d
parents gatekeeper
check_command check-host-alive
contact_groups admins

event_handler powercycle-device!e.f.g.h!private!6!i!7
}

and the event handler is:

Power cycle a device connected to an APC MasterSwitch

define command{
command_name powercycle-device
command_line /usr/local/nagios/libexec/eventhandlers/powercycle-device $HOSTSTATE$ $HOSTSTATETYPE$ $HOSTATTEMPT$ $ARG1$ $ARG2$ $ARG3$ $ARG4$ $ARG5$
}

I’m getting the following is an extract from the log file and shows the event being fired rapidly:

[1165967490] HOST EVENT HANDLER: router;DOWN;SOFT;1;powercycle-device!e.f.g.h!private!6!i!7
[1165967490] HOST ALERT: router;DOWN;SOFT;2;CRITICAL - Host Unreachable (a.b.c.d)
[1165967490] HOST EVENT HANDLER: router;DOWN;SOFT;2;powercycle-device!e.f.g.h!private!6!i!7
[1165967490] HOST ALERT: router;DOWN;SOFT;3;CRITICAL - Host Unreachable (a.b.c.d)
[1165967490] HOST EVENT HANDLER: router;DOWN;SOFT;3;powercycle-device!e.f.g.h!private!6!i!7
[1165967490] HOST ALERT: router;DOWN;SOFT;4;CRITICAL - Host Unreachable (a.b.c.d)
[1165967490] HOST EVENT HANDLER: router;DOWN;SOFT;4;powercycle-device!e.f.g.h!private!6!i!7
[1165967491] HOST ALERT: router;DOWN;HARD;5;CRITICAL - Host Unreachable (a.b.c.d)
[1165967491] HOST NOTIFICATION: admin;router;DOWN;host-notify-by-sms;CRITICAL - Host Unreachable (a.b.c.d)
[1165967491] HOST EVENT HANDLER: router;DOWN;HARD;5;powercycle-device!e.f.g.h!private!6!i!7

The powercycle-device script will only tell the MasterSwitch to reboot the port when the status is DOWN/HARD.

Is there any way I can ensure that it waits a sensible amoung of time before cycling the router’s power?

I’ve considered doing this on a service, using check_ping and not having a host check_command but then I’ll get incorrect notifications about devices that are the other side of the router!

Any advice would be appreciated. Many thanks,

Zeb


#2

nagios.sourceforge.net/docs/2_0/ … dlers.html
Pretty good explanation in the nagios docs. So, what is happening, is supposed to happen. The reason is, nagios wants you to have the ability to take action, after every state change, even soft.
So, read the docs, look at there example, and then modify your powercycle script to fit your needs. Need it to reboot after Soft or hard, then your script makes than decision, not nagios.


#3

Hi,

thanks for your reply. My script is handling all states as expected so that a DOWN/SOFT won’t trigger the power cycle but DOWN/HARD will. I did mention that, just under the log extract.

The problem is that the 5 check_commands get executed far too quickly. If you look at the timestamps you’ll see that all 5 checks were made within only 1 second!

I’ve tried setting check_interval (even though not recommended) but that doesn’t even resolve it.

I can’t use the service checks as they’re suspended whilst the host is determined to be down!

Any other advice?

Many thanks,

Zeb.


#4

I’ve never gotten around to working out what ‘obsess’ does in Nagios. It’s not something to do with that is it?


#5

It looks like that only controls whether the ochp_command runs after each host check, which doesn’t seem to be what I need.

I’m sure that Netsaint didn’t perform the host checks so frequently, but then again that was a long time ago and my memory’s not that good!


#6

obsess is a method of running some command after each check is ran. It’s widely used to pass the output of a service check on a remote host to a nsca client, which in turn, passes the data on to the nagios central server. That is a distributed setup and obsess is the main ingredient.

check_interval is only going to force nagios to run a host check every “check_interval”, which is usually redundant, since you most likely are already running a service check on a host.

max_check_attempts set to some high value will make nagios take longer to run the host check, but again, that’s not a good thing either. Rason is, because nagios suspends all other activity, until it finds out what the heck is wrong with this host. It has to do that, because the rest of the network just might depend on that host.

Host checks are supposed to be very very quick. Logic is, that you have already ran a service check on the host, so if it fails that, then most likely the host is down too.

If you want nagios to power cycle the host, if it fails to respond to a check-host=alive, then why not power cycle it, if it fails a service check of “check-icmp”? With service checks, you can set the max_check_attempts as high as you want, and it won’t impact the rest of nagios checks. retry_check_interval can be set higher also. If the check_ping or check_icmp fails, then surely it’s going to do the exact same thing with check-host-alive right? So what’s the diff, power cycle it on the service failure, not the host check.


#7

Good heavens. I’ve been wondering about setting up a distributed system now that mine is growing somewhat. That’s really useful, many thanks. I’ve just been playing with nsca to do some passive checks for backup script success and suchlike so I’m some of the way there already :slight_smile:


#8

My nagios central server is almost 100% passive, using nsca. All checks are made on remote 'nix type boxes with nagios installed on them. They utilize that obsess and pass the data on to central server. Why setup nagios on remote hosts? Well, I have 3 remote 'nix boxes that run the majority of the checks. But there are some checks, that just can’t be done remotely. So I install nagios on those boxes also. It’s better managed that way, instead of running a cronjob on the remote host. It doesn’t take up much space or system resources either. Never had a problem with it. On winblows boxes, I use nc_net, which has a built in nsca client. So, central server just sits here, doing nothing, except accepting output and displaying.


#9

OK, I see what you mean. To test it first of all I was changing the IP address of router in hosts.cfg and then restarting Nagios. That caused the host check to run immediately. I’ve now tried it by adding an invalid route for the router and this is the result (approximately 150 seconds after I added the route):

[1166286011] HOST ALERT: router;DOWN;SOFT;1;CRITICAL - Host Unreachable (a.b.c.d) [1166286011] HOST EVENT HANDLER: router;DOWN;SOFT;1;powercycle-device!e.f.g.h!private!6!i!7 [1166286014] HOST ALERT: router;DOWN;SOFT;2;CRITICAL - Host Unreachable (a.b.c.d) [1166286014] HOST EVENT HANDLER: router;DOWN;SOFT;2;powercycle-device!e.f.g.h!private!6!i!7 [1166286017] HOST ALERT: router;DOWN;SOFT;3;CRITICAL - Host Unreachable (a.b.c.d) [1166286017] HOST EVENT HANDLER: router;DOWN;SOFT;3;powercycle-device!e.f.g.h!private!6!i!7 [1166286020] HOST ALERT: router;DOWN;SOFT;4;CRITICAL - Host Unreachable (a.b.c.d) [1166286020] HOST EVENT HANDLER: router;DOWN;SOFT;4;powercycle-device!e.f.g.h!private!6!i!7 [1166286023] HOST ALERT: router;DOWN;HARD;5;CRITICAL - Host Unreachable (a.b.c.d) [1166286023] HOST NOTIFICATION: admin;router;DOWN;host-notify-by-sms;CRITICAL - Host Unreachable (a.b.c.d) [1166286023] HOST EVENT HANDLER: router;DOWN;HARD;5;powercycle-device!e.f.g.h!private!6!i!7 [1166286023] SERVICE ALERT: router;PING;CRITICAL;HARD;1;CRITICAL - Host Unreachable (a.b.c.d)

It still seems that the host check is being performed before anything else and then the service check is happening after the host has been determined to be down! The router would have been power cycled after 12 seconds of not responding,

I’d prefer it to power cycle the device after check_ping fails 3 times, as configured in services.cfg. My service checks are set to be performed every 5 minutes.


#10

Most likely because you have set
check_interval for the host, which is not a good idea. Also, you shouldn’t have an event handler enabled for the host.
Lastly, never use the nagios restart function., Use reload if you make config changes.


#11

Plus, you may have to use the cgi pages to enable/disable event handlers now due to retention, even if you made a config file change.

nagios.sourceforge.net/docs/2_0/ … tion_notes


#12

I did try setting it but it’s not set now.

Why not? If it’s so bad, why is there provision for it?

I tried using the reload but then found that all my services were then showing as being down. I then switched to restart and everything looked OK.

I’ve now found that there were permission problems on resource.cfg so it couldn’t be reloaded as the nagios user. I’ve fixed it and it appears to be OK.

Just recently I’ve been adding a duff route so I haven’t been restarting nagios.

Even if I put the event handler on the PING service against router then the above log extract shows that it will be power cycled 12 seconds after first failing to respond!

I have the following configured against my services, so I’ve like the router power cycled 2 minutes after it’s first spotted down:

max_check_attempts 3 normal_check_interval 5 retry_check_interval 1

I can’t see how I can do this without disabling the host check.

Many thanks.


#13

"Also, you shouldn’t have an event handler enabled for the host."
I said that because I thought we where going to make the event handler fire off when you have a service check in the hard state, and not the host. So why have it enabled, if we aren’t going to use it? It’s not a bad thing, but it’s just kinda not needed right?

ANyway, I still think it’s the script. As stated before, an event handler will be processed each and every time there is a change. So it’s up to the script, to make the decision as to when we are going to reboot the box. Soft state, no reboot, hard state, yes reboot. From your service settings above, it will take 3 intervals for the check to fail. I don’t know what your interval is, but that might equate to 3 minutes. Depends…


#14

"I tried using the reload but then found that all my services were then showing as being down. I then switched to restart and everything looked OK."
Seriously, never use restart. If you must restart, then use the stop and start, but not restart.
So I think you should make sure what you have now by doing this:
/etc/rc.d/init.d/nagios stop
ps -ef|grep nagios
kill any that remain
verify with another
ps -ef|grep nagios
now start nagios
/etc/rc.d/init.d/nagios start


#15

OK, it’s not the event handler script. Here it is in its entirety:

[code]#!/bin/sh
echo $1 $2 $3 $4 $5 $6 $7 $8 >> /tmp/powercycle.log

What state is the HTTP service in?

case “$1” in
DOWN)

The device appears to have a problem - tell the MasterSwitch to power

cycle it

Is this a “soft” or a “hard” state?

case “$2” in
HARD)
#
# We’re in a “hard” state, meaning that Nagios has determined that this
# device has failed to respond a number of times.
#
echo -n “Power cycling outlet $6” >> /tmp/powercycle.log

#
#
#
/usr/bin/snmpset -c $5 -v 1 $4 .1.3.6.1.4.1.318.1.1.4.4.2.1.3.$6 $7 $8 >> /tmp/powercycle.log
;;

esac
;;
esac
exit 0
[/code]

I understand that the event handler will run after every event and that it’s up to the handler to decide what to do. As you can see, it will only ever send the power cycle command to the MasterSwitch when it sees a DOWN/HARD

It produces a crude log which is:

DOWN SOFT 1 e.f.g.h private 6 i 7 DOWN SOFT 2 e.f.g.h private 6 i 7 DOWN SOFT 3 e.f.g.h private 6 i 7 DOWN SOFT 4 e.f.g.h private 6 i 7 DOWN HARD 5 e.f.g.h private 6 i 7 Power cycling outlet 6SNMPv2-SMI::enterprises.318.1.1.4.4.2.1.3.6 = INTEGER: 7

Which shows that the power cycle only happens on a DOWN/HARD.

I don’t mind whether the power cycle happens on a host event or a service event but the log above shows that the host went DOWN/HARD and the service went CRITICAL/HARD after only 12 seconds after the problem was first spotted.

I’m going to try disabling the host check and just using the service check to see if that helps matters. Unfortunately that will mean Nagios will stil attempt to check anything that’s the other side of the router.


#16

OK, I see where the problem is. You want to reboot the box after 2 minutes of “not responding”. Problem is, that when a service check returns a “non-ok” state, the host check is ran. The host check only takes a very short time, and of course, it is going to fail also. Due to the host check failing, the service check is “assumed” failed. So, we need to make your host check last for 2 minutes somehow. But if we do that, then all other nagios activity will be suspended.
nagios.sourceforge.net/docs/2_0/ … uling.html

Personally, I think you are asking for too much. If the box will not respond to a ping host check, I don’t see why you would not want to reboot the box. If the box is failing to reply to pings for up to 2 minutes at a time, then you really have a problem that should be fixed.
But if you demand to have the box rebooted after 2 minutes, then I suppose you could use a host check that will ALWAYS return a “UP” status. That way, the service check will continue on with it’s max retries and retry interval, which you would make add up to 2 minutes, and then your script would fire off the reboot.


#17

nagios.sourceforge.net/docs/2_0/ … uling.html
From the docs:
“One instance where Nagios checks the status of a host is when a service check results in a non-OK status. Nagios checks the host to decide whether or not the host is up, down, or unreachable. If the first host check returns a non-OK state, Nagios will keep pounding out checks of the host until either (a) the maximum number of host checks (specified by the max_attempts option in the host definition) is reached or (b) a host check results in an OK state.”


#18

I read that too. This seems a bit daft. If a service fails, a host check is performed. If that too fails then it’ll keep going (as fast as possible) until max_attempts has been reached when the host is deemed to be down and the service is marked down as well.

In this case, the 5 host checks takes only 12 seconds to complete.

So after only 12 seconds of not responding the host and service are marked as down! And the router would be restarted.

I’ve tried this with no host check_command specified and the event running on the service check failing and it works as expected.

If only there was a retry_check_interval for host checks it would work as expected.


#19

Well, it’s not really daft, if you think about it. A machine that fails a service check such as httpd, and also is found to NOT RESPOND to a PING, is surely down. Since many other hosts might depend on this host to be up, it’s imperative for nagios to stop all else, and determine it’s status. Why bother to continue checking for an httpd service, when we can’t even ping the device with check-host-alive. Bottom line is, if you have devices that don’t reply to a ping or service check from time to time, then you might need to make a change on your service check, so that it doesn’t fail the first time. Or perhaps fix the real problem with the host.


#20

I understand the way it works and think it is sensible.

What I think is daft is the fact that after only 12 seconds of not responding the host is deemed down.

This router shouldn’t ever fail to respond in normal operation but I would prefer that Nagios waited more than 12 seconds before taking any action!

In fact, had I had it working it would have come in handy this morning when the router locked up!