first off, i’ve got a around 200 servers being monitored, about 30 routers, and about 15 checkpoint sites, and nagios has been working great. my question is this, is there some way to set the hosts up to check every 5 minutes or so? the reason i ask this is because i tend to get a lot of messages (with the standard ‘real time’ check) about a few of the routers and checkpoint sites when they are not really down, but just timed out for a few seconds. i’ve changed the interval length in my nagios.cfg to 180 seconds, and my hosts.cfg is pretty much the default. since i only monitor a handful of services, i don’t mind if i have to change a setting that will effect stuff ‘across the board’. does anyone know of an easy way to do this. changing the interval length seems to have helped some, but is there something else i could do that might be better? thanks in advance for the help.
Hi,
I think your after the max_check_attempts option within the service or host definition. Setting to 2 will stop you from gettting a notification unti lthe 2nd check attempt fails.
i’ve got my max_check_attempts set to 5 right now. when i changed the interval length with this, i’ve got more of a delay but i was wondering if there was another way. maybe i need to bump the interval delay up to 300?
define host{
use generic-host ; Name of host template to use
host_name RALSAPPS03
parents VOIN Cisco Switch 4507R
alias RALSAPPS03
address x.x.x.x
check_command check-host-alive
max_check_attempts 5
notification_interval 720
notification_period 24x7
notification_options d,r
contact_groups admins,gsc,tech
}
if you were doing this on a defined service you could change your normal_check_interval to be something other than 1 - this is a multiplier of the interval length. Not sure that this works on a host definition tho.
Are you saying that your router fails to respond 5 times in a row and so you are getting email notifications even now ? If so it sounds like you have big network problems with packet loss.
Looking at the definition for check-host-alive, you could change it to send more than one packet by changing the -p flag. It only reports a failure on 100% loss so by changing to 5 you only need one reply.
Or you might want to try and fix your network ?
you might be able to just put a ‘check_interval’ into the host definition?
nagios.sourceforge.net/docs/2_0/ … k_interval
they warn however not to do this for optimization reasons.
Edited Thu Jan 05 2006, 04:30PM ]
hey now
i had tried the check interval with my host definition even though it’s not recommended and it just looked right over as if it wasn’t even there. :?
Edited Thu Jan 05 2006, 04:38PM ]
I’m failry sure you are wrong you will get the warning after max_check_attempts X interval length. So if the interval is 60 and you have max_check_attempts set to 5 you will get an email after 5 mins (although the display in nagios will go red after the initial failure).
Thinking again about what your doing, I thing I would create a copy of check-host-alive (check-dodgy-host-alive) with a -p setting of 5 and use that as the check command for the dodgy routers/checkpoints.
what about changing the ‘host_inter_check_delay_method’ in the nagios.cfg to a x.xx value instead of smart mode and then setting the ‘max_host_check_spread’.
the check spread defines the max amount of time you want to have all of the hosts checked by, and the inter check determines how nagios will spread out the host checks when it starts monitoring…
shrug just a thought, might be worth looking at.
markj, what your saying about interval length is making sense now. that’s probably why it looks like it was working better when in actuality (sp?) it wasn’t. i’ll create another check-host-alive for some of my slower sites and call it something like check-host-alive-slow and edit my hosts.cfg with it. that might do the trick.
tsumi, the inter_check delay would probably initially work, but once it hits that first slow site it might still send me messages. i’m not sure. i’ll have to play around with it some.
I think you are going about this the wrong way completely.
First, you defined your nagios.cfg interval length to be 180, instead of default 60, which makes for a headache when you talk with other nagios users, since they assume an interval is 1 minute.
The host checks should NEVER be done, since you will most likely have a service check of check_ping. Your problem is entirely the plugin itself that you have for your “real time check”. I don’t know what service check you are getting false possitives out of, but address that issue.
One thing I noticed, is that the check_ping or any service check for that matter, can actually be showing a device is down, and in reality, it is down, since it’s not responding to a “ping” for example.
So, in that case, the best thing to do, is to make darn sure it won’t “ping” for a very long time, before I send out an alert.
So, for the check_ping command, you may want to play with the -p option, which sets the number of packets sent. If I set -p 1, then surely, I’ll have many many check_failures. But if I set -p 10, then it will send 10 packets, and if it fails after that many, the device MUST be down.
First, set nagios.cfg interval_length=60 like it was as default, and any other item you changed like the service_interleave_factor (leave it at s)
We want to make darn sure it’s down, so we set services.cfg for the service check that is a problem to :
max_check_attempts 5
normal_check_interval 5
retry_check_interval 1
That way, if the check command fails, it will retry it every 1 minute for a total of 5 retries all 1 minute apart. If it still fails, then the service is definitely down. Or do you have devices that actually won’t respond to a service check for longer than a 5 minute period? If so, then set max_check_attempts 10. That would be a total of 10 checks all 1 minute apart for a span of 10 minutes. Still fails after 10 retries? Then whatever plugin you are using is totally crap, or you do in fact, have a serious communication problem with your equipment.
Look at it this way. I have the same problem with 2 switch stacks in one of our hub closets. Both stacks, and various times, simply stop responding to a ping. They will do this for a span of 2 minutes or so at times, but since I have max_check_attempts 5, alerts are never sent, since the check_command works after the 2nd or 3rd retry (or 2 or 3 minutes later). But since nagios has shown me these failures, I am able to identify that there is a problem with this equipment. It’s not the nagios setup, and at this time, I don’t know if it’s the router or the switches, or the fiber or what. Same in your situation. Nagios is telling you that you have a problem. You really need to figure out why your service_check is timing out, and showing the device is down. This tells me that your customers are also denied access. Unlike my situation with the switch stacks, there is no interuption for devices plugged into the switch. My only problem is getting the switch stack management IP addy to respond to a ping. So in my case, customers are NOT affected. In your case, they might be, and you need to fix that, not fix nagios.
Edited Thu Jan 05 2006, 08:10PM ]