Huge Check Latency

Hi,

I’ve had Nagios 1.3 running for a few months now just fine, then we had a powercut yesterday and the nagios box went off, since then it has huge check latencies.

Time Frame Checks Completed
<= 1 minute: 0 (0.0%)
<= 5 minutes: 0 (0.0%)
<= 15 minutes: 40 (10.9%)
<= 1 hour: 366 (100.0%)
Since program start: 40 (10.9%)

Metric Min. Max. Average
Check Execution Time: < 1 sec 10 sec 0.098 sec
Check Latency: 2 sec 242 sec 151.691 sec
Percent State Change: 0.00% 0.00% 0.00

Does anyone have any ideas?

I thought it might be a corrupt log file, so i cleared all the old logs out, this improved the latency from about 60mins to around 15, but it was managing 5 -6 mins before.

I’m still not sure if my config is done right, but it was working before, although if any more than one of the hosts went down at the same time the latency went up by loads.

Which of the config files would I need to post so someone could look to see if there is something wrong?

I don’t think i’ve ever really got it setup right in the first place but i’ve been through all the documentation and I can’t see what i’ve done wrong.

Thanks for your help.

When a service check fails, the first thing nagios has to do, is stop everything else and find out if the host is up or down. After that, it can continue to perform the other scheduled duties. But during that host check, NOTHING else happens. So, if you have a poor host check, that may be your problem. I suggest that you make your host checks very simple and fast. I would suggest to make it fail after one check only, and not the default of 5 retries.
EX: instead of using :
define command{
command_name check-host-alive
command_line $USER1$/check_ping -H $HOSTADDRESS$ -w 3000.0,80% -c 5000.0,100% -p 5
}
for a host check, make it a “-p 1”, since 1 is alot quicker.

BTW, you may want to look at each service you have, and find out which one is taking so long to get output for.

The default naigos.cfg file has:
service_check_timeout=60 #that is 60 seconds, but it appears you may have changed that too. What else have you changed from the default values?
max_concurrent_checks possibly?
auto_reschedule_checks???
service_interleave_factor???
It could be almost any # of things that you changed from the tried and true default settings.

Anyway, yes, you have horrible latency, and I would suggest to reinstall all of the default values in your .cfg files, unless you can remember what you have changed from defaults.

Hi thanks for you help.

I did some more investigation, the hosts.cfg was set as follows:

hosts.cfg

define host{
use generic-host
host_name BLAH.BLAHServer
alias Blah Blah
address 10.0.0.2
check_command check-host-alive
max_check_attempts 20
notification_interval 120
notification_period 24x7
notification_options d,u,r
}

with max_check_attempts as 20. I though that was a little much so set it to 3 instead.

Seemed to help a bit.

Then
checkcommands.cfg

‘check-host-alive’ command definition

define command{
command_name check-host-alive
command_line $USER1$/check_ping -H $HOSTADDRESS$ -w 3000.0,80% -c 5000.0,100% -p 3 -t 20
}

Changed the number of pings to 1 as you suggested.

‘check-host-alive’ command definition

define command{
command_name check-host-alive
command_line $USER1$/check_ping -H $HOSTADDRESS$ -w 3000.0,80% -c 5000.0,100% -p 1 -t 20
}

however if I change the -t20 value to anything other than 20 i get this when doing a check on a downed host:

CRITICAL - Socket timeout after 10 seconds 

Also as you suggested I check the nagios.cfg file that looks okay too.

Nagios.cfg

service_check_timeout=60
host_check_timeout=60
event_handler_timeout=30
notification_timeout=30
ocsp_timeout=5
perfdata_timeout=5

Some of these tweaks have made things better i’m going to take a look and get back to you with what it did.

Time Frame Checks Completed
<= 1 minute: 0 (0.0%)
<= 5 minutes: 240 (65.6%)
<= 15 minutes: 366 (100.0%)
<= 1 hour: 366 (100.0%)
Since program start: 366 (100.0%)

Metric Min. Max. Average
Check Execution Time: < 1 sec 10 sec 0.784 sec
Check Latency: 20 sec 180 sec 107.063 sec
Percent State Change: 0.00% 0.00% 0.00%

Thanks for you help thus far.

Hi,

Okay im sort of getting somewhere.

It looks like when it finds a host that is down, it sticks on it for ages and then makes a huge latency, when it eventually finishes checking the downed host its latency drops loads, but when it comes round again it goes up.

What am I doing wrong? I have it doing obsesive host checks would this cause a problem?

Thanks

obsess_over_hosts 1 only if this is the remote nagios install that will be sending data to the Central nagios server via send_nsca, by way of the ochp_command= , which is what in your case?

Nagios is working the way it should. As I stated earlier, everything stops when a service fails and a host check is being made. So it’s VITAL that you make this host check be as quick and clean as possible.
REad this:
nagiosexchange.org/nagios-us … wUid%5D=28

So, let’s start here:
Get the latest check_icmp from here:
oss.op5.se/nagios/
oss.op5.se/nagios/check_icmp-2005-06-01.tar.gz
untar it, then run make check_icmp
cp check_icmp to your libexec dir.
chmod 4750 check_icmp
chown root.nagios check_icmp

try it out by hand AFTER you have su - nagios
su - nagios
./check_icmp -h
./check_icmp -H localhost -w 20,20% -c 30,30%

Pay attention to what the -h help just said:
“If this program is invoked as check_host (with a symlink, preferrably), it will exit with status
OK upon the first properly received ICMP_ECHOREPLY, making it ideal
for hostchecks (less network load and much, much faster on OK)”

So, let’s do that.
cd /usr/local/nagios/libexec
ln -s check_icmp check_host
./check_host localhost
vi checkcommands.cfg
define command{
command_name check_host
command_line $USER1$/check_host -H $HOSTADDRESS$
}

vi hosts.cfg
make sure that ALL of your host checks are NOW check_host instead of check-host-alive.
check_command check_host
#check_command check-host-alive

/etc/rc.d/init.d/nagios reload.

Now, when a service fails, a host check will be made, but NOW, the host check is going to be quick. Unlike yours where you had -p 3, or whatever.

PS, make hosts.cfg
max_check_attempts 1
Also, we don’t want to waste time, since the service check has already failed, so it’s likely the host is down.

Hi,

I disabled the obsess_over_hosts command (i’m not using nrpe), i enabled this a short while ago becuase nagios sometimes missed when a host came back up.

Here’s the stats now:

<= 1 minute: 20 (5.5%)
<= 5 minutes: 366 (100.0%)
<= 15 minutes: 366 (100.0%)
<= 1 hour: 366 (100.0%)
Since program start: 366 (100.0%)

Metric Min. Max. Average
Check Execution Time: < 1 sec 5 sec 0.407 sec
Check Latency: < 1 sec 1 sec 0.128 sec
Percent State Change: 0.00% 0.00% 0.00%

Looks to be working much better now, thanks for your help jakkedup it much appriciated, i’ll give those check_icmp suggestions a whirl to see what they do.

Thanks again