Retry Checks and premature alerts


#1

I have been running Nagios sucessfully for a couple years now without any major issues and without any major config changes but recently i’ve begun to see a steady increase in premature alerts, unreachable host and plugin timeouts. The recheck interval seems to be set too fast (3 sec) even though my config is at 10 check attempts and at minimun 3 minutes. Some of the questionable alert history data is below. Anyone have any idea what could be causing this or how i can increase this recheck for unreachable hosts or timeouts to something more reasonable?

[01-11-2006 05:46:19] HOST ALERT: xxx;DOWN;HARD;10;CRITICAL - Host Unreachable (x.x.x.:evil:
Host Down[01-11-2006 05:46:16] HOST ALERT: xxxx;DOWN;SOFT;9;CRITICAL - Host Unreachable (x.x.x.:evil:
Host Down[01-11-2006 05:46:13] HOST ALERT: xxxx;DOWN;SOFT;8;CRITICAL - Host Unreachable (x.x.x.:evil:
Host Down[01-11-2006 05:46:10] HOST ALERT: xxxx;DOWN;SOFT;7;CRITICAL - Host Unreachable (x.x.x.:evil:
Host Down[01-11-2006 05:46:07] HOST ALERT: xxxx;DOWN;SOFT;6;CRITICAL - Host Unreachable (x.x.x.:evil:
Host Down[01-11-2006 05:46:04] HOST ALERT: xxxx;DOWN;SOFT;5;CRITICAL - Host Unreachable (x.x.x.:evil:…


#2

Maybe I asked this question poorly. Basically, when a host goes Critical or Unreachable as in the above case Nagios is rechecking it every 3 seconds up to its max_check_attempts variable (which is 10). This is too frequent, where is the configuration option that will allow me to increase this time interval? I realize i could increase the check attempts to some number greater than 10 to extend the time before an alert but that seems like the wrong way to go about it.


#3

I might be wrong, but it looks like you’ve set your global interval_length to 3 seconds ? Is that the case?

If so, I would imaging it is a little excessive - even my interval of 15 seconds is a bit obsessive.

Anyway, you can set the retry_check_interval for a service - this is a multiple for interval_length


#4

I believe that the default number of retry checks is 3, which would mean you’ve got your interval length set to 1 second. I tried that once, but it’s too big of a hassle. Go back to the default of 60 seconds. You’ll find this setting in your nagios.cfg file. You’ll probably want to grep the file to find out where the line is, though, cuz it’s a ways down.


#5

Those are the things i considered too but it seems not to be it. My interval_length is still the default of 60 seconds and the retry_check_interval of a failed service still seems to be working correctly if the Host is actually still online and pingable. The problem seems to be only limited to Host using the check_host_alive command and Nagios cannot actually communicate with it. When the host goes offline or in my case there is a minor network blip Nagios has already sent out an alert because its only got 3 second window. I don’t have active host checks enabled so there is no host check interval defined. It must be getting its interval to retry that unreachable host from somewhere else.


#6

INTERVAL LENGTH

This is the seconds per unit interval as used in the

host/contact/service configuration files. Setting this to 60 means

that each interval is one minute long (60 seconds). Other settings

have not been tested much, so your mileage is likely to vary…

interval_length=60

The above is default. You state yours is default. Check again, it can’t be.


#7

Yes, check again, and then go back to base if you’re positive that it’s set to 60 seconds. Make sure there aren’t any errors when you run a nagios -v command.