HUGE Nagios Check Latency


#1

Hello,

I am running Nagios 2.10 which monitors approximately 3000 hosts (1 service per host).
I get a huge check latency and have tried a bunch of things which apparently dont work
Here is the output for the preflight nagios check

Nagios 2.10
Copyright © 1999-2007 Ethan Galstad (nagios.org)
Last Modified: 10-21-2007
License: GPL

Projected scheduling information for host and service
checks is listed below. This information assumes that
you are going to start running Nagios with your current
config files.

HOST SCHEDULING INFORMATION

Total hosts: 3345
Total scheduled hosts: 0
Host inter-check delay method: SMART
Average host check interval: 0.00 sec
Host inter-check delay: 0.00 sec
Max host check spread: 30 min
First scheduled check: N/A
Last scheduled check: N/A

SERVICE SCHEDULING INFORMATION

Total services: 3345
Total scheduled services: 3345
Service inter-check delay method: SMART
Average service check interval: 300.00 sec
Inter-check delay: 0.09 sec
Interleave factor method: SMART
Average services per host: 1.00
Service interleave factor: 1
Max service check spread: 30 min
First scheduled check: Mon Aug 18 10:38:13 2008
Last scheduled check: Mon Aug 18 10:43:12 2008

CHECK PROCESSING INFORMATION

Service check reaper interval: 10 sec
Max concurrent service checks: 400

PERFORMANCE SUGGESTIONS

I have no suggestions - things look okay.

Here is the tactical overview :
Service Check Execution Time: 0.24 / 15.46 / 2.463 sec
Service Check Latency: 634.00 / 683.67 / 658.990 sec
Host Check Execution Time: 0.03 / 7.22 / 0.676 sec
Host Check Latency: 0.00 / 0.00 / 0.000 sec

Active Host / Service Checks: 3345 / 3345

Passive Host / Service Checks: 0 / 0

As you can see above, i have forced max concurrent service checks to 400 instead of the default “0” (unlimited) . The service check latency is in hundreds (and at times reaches in thousands)
I have kept the ping check as lean as possible.
$USER1$/check_ping -H $HOSTADDRESS$ -w 3000.0,80% -c 5000.0,100% -p 1 -t 5

also, i use “pexec” (an alternative to rsh) to check hosts

Kindly let me know if you have any suggestions

thank you!


#2

Hi,

From what I can see, I’ve a small doubt:
did you set your hosts to be checked ? (do they have a check_command defined and “active_checks_enabled” set to 1 ?).
If yes: is it really necessary ?
In fact, hosts checks are executed whenever a service goes down; as your services are PINGs, there’s no point doing a PING on the host :slight_smile: so you could set “active_checks_enabled” to 0 => it may save some tests, thus, maybe, decreasing a little bit the latency

apart from the nagios settings, there are several reasons you may have latency. Here are the ones I know:
-CPU overloaded: the more services you add, the more your CPU will be loaded … at a certain point, the CPU won’t be able to handle that much tests, and nagios will increase its latency.
=> do a “top” command on your server to check your CPU load: you should look at the %idle, and your 3 load values.
Note: if the load values exceed the number of CPUs of your server (like a load equal to 3 when you have only 2 CPUs), that means your CPU is overloaded

-max progs limit reached:
some OS (or all, I don’t know :)) have a limit on the number of processes a user can launch simultaneously. Maybe you can have a look at this setting for your OS (I really can’t help you further on this one, as I don’t know much on the subject).
Note: in order to do 1 check, nagios has:

  1. to fork => count as 1 process
  2. launch the PING plugin => that’s a second process
  3. the PING plugin calls the PING exec of your system => that’s a third process
    you need 3 processes just to do 1 PING … so the count can increase pretty quickly and reach the limit

Hope this helps a little bit.


#3

Hello Loose,

Thanks a lot for your prompt response.
i) i dont have active_checks_enabled under hosts.cfg but i do have that under services.cfg; disabling them from services.cfg made nagios not check any machines at all. Is this what you recommended?

ii)I have a powerful machine (8 cpus) which runs red hat linux 5 so cpu overhead is not an issue. average load is 3.20 as of now which is quite normal

iii) i have set the limit of processes for user nagios under /etc/security/limits.conf; soft 200 & hard 400

I have restarted nagios & erased all logs to have fresh data (from scratch)

Shall update you how things go.

Thanks once again!!


#4

Hello Loose,
Here’s a quick update:
After restarting nagios process and clearing up the previous logs, the first 15 hours it seems to be stable with latency of merely 8-9 secs. Which I believe is affordable (due to number of hosts i monitor) But the latency then exponentially increases and touches the 200 seconds mark.

i am not able to understand the reason behind it.
Here’s the tactical overview captured as of now:

Service Check Execution Time: 0.09 / 15.46 / 2.514 sec
Service Check Latency: 70.48 / 147.59 / 116.179 sec
Host Check Execution Time: 0.03 / 6.55 / 0.628 sec
Host Check Latency: 0.00 / 0.00 / 0.000 sec

Active Host / Service Checks: 3345 / 3345

Passive Host / Service Checks: 0 / 0

Any recommendations?

Thanks a lot for taking the time & pain to help us out :slight_smile:


#5

nop :slight_smile:
disabling the active_checks on your service result in nagios doing nothing, and that’s normal ^^ (in clear, you’re telling nagios that it shouldn’t do the service checks :)). (I said to disable them on your hosts; but as they were already disabled…).

Anyway, you’re problem is quite weird; getting normal latency for 15 hours (8-9 sec is perfectly normal for such an installation) and then the latency increasing that much is not something I would have expected.
In the meantime, you could read nagios.sourceforge.net/docs/2_0/tuning.html ; this may (or probably not) help you.

And to finish by not really solving directly your problem (sorry :)), I had the “same” problem:
we had 4000 services on one server; although the load on the cpus was not too much (wasn’t low either :)), we had a latency of around 12min, and increasing for “each” test we added.
I did some graphs showing the increase of latency relative to the number of services, and we could see that it’s increasing exponentially … which is not a good thing.

In the end, we concluded that the latency is a combination of a software limitation with a hardware limitation:
the better your CPUs are, the further the “acceptable limit” for latency is… but even with a server that was set up far above the requirements of nagios, we still observed a limitation, probably due to the way nagios is coded…

So, to solve once and for all (or not…) the problem, we decided to split our server into 3 servers.
They are on the same machine as before, as the CPUs are capable of handling it.
We created 3 accounts (nagios, nagios2, nagios3), each having access to 2 CPUs. These 3 servers are totally independant (I advise you to do the same; I tried to install the 3 on the same nagios install => it took me a while to get it working, and the performances were really bad).
The result is a very low latency on each server, a better repartition (like: nagios3 for networking monitoring; nagios2 for databases and nagios for the rest), and a crossed monitoring (each nagios is supervising the others :)).
Although the only limitation is that, overall, we use more ressources than before (more disk use, more cpu use, more ram use), but that’s a fair tradeoff in the end :slight_smile:

Sorry if this doesn’t directly answer your problem, but I thought that might give you some ideas :wink:

edit: OMG! there are too many smileys; sorry!