HUGE Nagios Check Latency



I am running Nagios 2.10 which monitors approximately 3000 hosts (1 service per host).
I get a huge check latency and have tried a bunch of things which apparently dont work
Here is the output for the preflight nagios check

Nagios 2.10
Last Modified: 10-21-2007
Projected scheduling information for host and service
checks is listed below. This information assumes that
you are going to start running Nagios with your current
config files.


Total hosts: 3345
Total scheduled hosts: 0
Host inter-check delay method: SMART
Average host check interval: 0.00 sec
Host inter-check delay: 0.00 sec
Max host check spread: 30 min
First scheduled check: N/A
Last scheduled check: N/A


Total services: 3345
Total scheduled services: 3345
Service inter-check delay method: SMART
Average service check interval: 300.00 sec
Inter-check delay: 0.09 sec
Interleave factor method: SMART
Average services per host: 1.00
Service interleave factor: 1
Max service check spread: 30 min
First scheduled check: Mon Aug 18 10:38:13 2008
Last scheduled check: Mon Aug 18 10:43:12 2008


Service check reaper interval: 10 sec
Max concurrent service checks: 400


I have no suggestions - things look okay.

Here is the tactical overview :
Service Check Execution Time: 0.24 / 15.46 / 2.463 sec
Service Check Latency: 634.00 / 683.67 / 658.990 sec
Host Check Execution Time: 0.03 / 7.22 / 0.676 sec
Host Check Latency: 0.00 / 0.00 / 0.000 sec

Active Host / Service Checks: 3345 / 3345

Passive Host / Service Checks: 0 / 0

As you can see above, i have forced max concurrent service checks to 400 instead of the default “0” (unlimited) . The service check latency is in hundreds (and at times reaches in thousands)
I have kept the ping check as lean as possible.
$USER1$/check_ping -H $HOSTADDRESS$ -w 3000.0,80% -c 5000.0,100% -p 1 -t 5

also, i use “pexec” (an alternative to rsh) to check hosts

Kindly let me know if you have any suggestions

thank you!



From what I can see, I’ve a small doubt:
did you set your hosts to be checked ? (do they have a check_command defined and “active_checks_enabled” set to 1 ?).
If yes: is it really necessary ?
In fact, hosts checks are executed whenever a service goes down; as your services are PINGs, there’s no point doing a PING on the host :slight_smile: so you could set “active_checks_enabled” to 0 => it may save some tests, thus, maybe, decreasing a little bit the latency

apart from the nagios settings, there are several reasons you may have latency. Here are the ones I know:
-CPU overloaded: the more services you add, the more your CPU will be loaded … at a certain point, the CPU won’t be able to handle that much tests, and nagios will increase its latency.
=> do a “top” command on your server to check your CPU load: you should look at the %idle, and your 3 load values.
Note: if the load values exceed the number of CPUs of your server (like a load equal to 3 when you have only 2 CPUs), that means your CPU is overloaded

-max progs limit reached:
some OS (or all, I don’t know :)) have a limit on the number of processes a user can launch simultaneously. Maybe you can have a look at this setting for your OS (I really can’t help you further on this one, as I don’t know much on the subject).
Note: in order to do 1 check, nagios has:

  1. to fork => count as 1 process
  2. launch the PING plugin => that’s a second process
  3. the PING plugin calls the PING exec of your system => that’s a third process
    you need 3 processes just to do 1 PING … so the count can increase pretty quickly and reach the limit

Hope this helps a little bit.


Hello Loose,

Thanks a lot for your prompt response.
i) i dont have active_checks_enabled under hosts.cfg but i do have that under services.cfg; disabling them from services.cfg made nagios not check any machines at all. Is this what you recommended?

ii)I have a powerful machine (8 cpus) which runs red hat linux 5 so cpu overhead is not an issue. average load is 3.20 as of now which is quite normal

iii) i have set the limit of processes for user nagios under /etc/security/limits.conf; soft 200 & hard 400

I have restarted nagios & erased all logs to have fresh data (from scratch)

Shall update you how things go.

Thanks once again!!


Hello Loose,
Here’s a quick update:
After restarting nagios process and clearing up the previous logs, the first 15 hours it seems to be stable with latency of merely 8-9 secs. Which I believe is affordable (due to number of hosts i monitor) But the latency then exponentially increases and touches the 200 seconds mark.

i am not able to understand the reason behind it.
Here’s the tactical overview captured as of now:

Service Check Execution Time: 0.09 / 15.46 / 2.514 sec
Service Check Latency: 70.48 / 147.59 / 116.179 sec
Host Check Execution Time: 0.03 / 6.55 / 0.628 sec
Host Check Latency: 0.00 / 0.00 / 0.000 sec

Active Host / Service Checks: 3345 / 3345

Passive Host / Service Checks: 0 / 0

Any recommendations?

Thanks a lot for taking the time & pain to help us out :slight_smile:


nop :slight_smile:
disabling the active_checks on your service result in nagios doing nothing, and that’s normal ^^ (in clear, you’re telling nagios that it shouldn’t do the service checks :)). (I said to disable them on your hosts; but as they were already disabled…).

Anyway, you’re problem is quite weird; getting normal latency for 15 hours (8-9 sec is perfectly normal for such an installation) and then the latency increasing that much is not something I would have expected.
In the meantime, you could read ; this may (or probably not) help you.

And to finish by not really solving directly your problem (sorry :)), I had the “same” problem:
we had 4000 services on one server; although the load on the cpus was not too much (wasn’t low either :)), we had a latency of around 12min, and increasing for “each” test we added.
I did some graphs showing the increase of latency relative to the number of services, and we could see that it’s increasing exponentially … which is not a good thing.

In the end, we concluded that the latency is a combination of a software limitation with a hardware limitation:
the better your CPUs are, the further the “acceptable limit” for latency is… but even with a server that was set up far above the requirements of nagios, we still observed a limitation, probably due to the way nagios is coded…

So, to solve once and for all (or not…) the problem, we decided to split our server into 3 servers.
They are on the same machine as before, as the CPUs are capable of handling it.
We created 3 accounts (nagios, nagios2, nagios3), each having access to 2 CPUs. These 3 servers are totally independant (I advise you to do the same; I tried to install the 3 on the same nagios install => it took me a while to get it working, and the performances were really bad).
The result is a very low latency on each server, a better repartition (like: nagios3 for networking monitoring; nagios2 for databases and nagios for the rest), and a crossed monitoring (each nagios is supervising the others :)).
Although the only limitation is that, overall, we use more ressources than before (more disk use, more cpu use, more ram use), but that’s a fair tradeoff in the end :slight_smile:

Sorry if this doesn’t directly answer your problem, but I thought that might give you some ideas :wink:

edit: OMG! there are too many smileys; sorry!