The huge Check Latency

ciberkot · November 14, 2005, 10:23am

Hi there,

I’m new here, but I did the search and didn’t find any question like my one.

We’re using Nagios for monitoring our server farm, it consists of around 1500 hosts with more then5000 different checks (NRPE, SNMP as well).

the problem that we have in the Check Latency, in avarage it’s more then 150-200 seconds :(.
The number of concurent check is set to “0” (unlimited),

host_inter_check_delay_method=s service_inter_check_delay_method=s (I tied fixed nuber from 1 sec to 0,1 sec with no big difference) service_reaper_frequency=2 max_host_check_spread=15 max_service_check_spread=15

nagios -s nagios.cfg shows

HOST SCHEDULING INFORMATION
Total hosts: 836
Total scheduled hosts: 0
Host inter-check delay method: SMART
Average host check interval: 0.00 sec
Host inter-check delay: 0.00 sec
Max host check spread: 15 min
First scheduled check: N/A
Last scheduled check: N/A

SERVICE SCHEDULING INFORMATION

Total services: 5451
Total scheduled services: 5447
Service inter-check delay method: SMART
Average service check interval: 775.09 sec
Inter-check delay: 0.14 sec
Interleave factor method: SMART
Average services per host: 6.52
Service interleave factor: 7
Max service check spread: 15 min
First scheduled check: Mon Nov 14 11:19:59 2005
Last scheduled check: Mon Nov 14 11:32:54 2005

CHECK PROCESSING INFORMATION

Service check reaper interval: 2 sec
Max concurrent service checks: Unlimited

PERFORMANCE SUGGESTIONS

I have no suggestions - things look okay.

if I reduce te number of checks to 4000 (I tried to remove diferent service groups) then latency turned in 20-30 seconds.

Where could be the problem and where should I turn?

The CPU usage on our dual-Xeon (2,4 Gh) nagios server is not very high.

jakkedup · November 14, 2005, 12:51pm

You should look at each and every service check and make sure it’s the best you can do.
Example: use check_ping or the quicker check_fping?
Do I use -n 2 or the slower -n 5 for check_fping?
Did I specify -m for all check_snmp checks or do I make the cpu search for the mib needed for the snmp check?
Am I checking things that have no real importance like “interface duplex setting” Really, how often is this changed since once it’s set, it never changes. In other words, removed checks that are just fluff.
There are others in the docs.
nagios.sourceforge.net/docs/2_0/tuning.html

I find that I don’t want to waste time to retry a check. If a certain check fails, there are some that I know I don’t want to check 5 more times, so look at your checks and trim out the retries.

If you have high latency at this time, is it due to having failed services?

For example, if you unplug your if cable from the nagios box, then all checks will fail. It won’t take long and you will have a huge latency. Reason, the retries. So don’t leave any checks in the failed state. If you do, then at least turn off active checks for that service.

If you have to, perform by hand your checks. Take a look at the ones that are slow to execute. Why are they slow?

jakkedup · November 14, 2005, 1:31pm

My check latency is .14 seconds at this time. That means, that on average, a check is late from it’s normal check execution my a max of .14 seconds. In your case, your checks are late over 3 minutes on average. Those settings you quoted are for v2.xbeta and I’m not familiar with them. Double check those settings and make sure they are correct. i.e. Max service check spread but it looks like that is just for the initial startup.

If your cpu is not topped out, I suspect that each service check is taking too long. What is your Check Execution Time min max average as reported in nagios/cgi-bin/extinfo.cgi?&type=4 (performace info)? Mine is min < 1 sec max < 2sec avg 0.387 seconds. 0.387 x 575 checks = 222.5 seconds for all 575 checks. My normal check interval is 300 seconds so that leaves me with about 80 seconds to spare. Now how does your calculate out? If for example you have the same avg as me, then it would be:
0.387sec/check x 5447 checks = 2107.989 seconds. Thats about 35 minutes. I’d suggest to back off on your normal service check interval. Add 1 minute to it, until you get the latency that is acceptable.

Also, I noticed that you use check_nrpe. If i’m not mistaken, that is a “active check” and that is a big mistake. In order to get the most bang for the buck with nagios, all checks on the central server should be 100% passive checks. So, you may need to setup a distributed nagios system. You could have 2 or 3 other nagios setups all performing active checks, like nrpe, and each of those would then send there data to the central nagios host that is 100% passive checks. In other words, the central nagios server is not making ANY CHECKS whatsoever. It’s simply displaying the results that the other setups have already taken. With this distributed setup, you may actually be able to get your latency down quite low.

jakkedup · November 14, 2005, 1:34pm

What the last post does for you is this:
It allows you to see on one website, the results of all of your checks. Otherwise, you would have to browse to many different websites, since your checks will be split up between 2 or 3 machines.

ciberkot · November 14, 2005, 3:16pm

[quote=“jakkedup”]What the last post does for you is this:
It allows you to see on one website, the results of all of your checks. Otherwise, you would have to browse to many different websites, since your checks will be split up between 2 or 3 machines. [/quote]

yes, you’re right, now we use the “single host” installation, the same host performs active checks and presents the results. But trully said I don’t see how the distribution installation helps me in my case, I should run all checks somewhere anyway, and if I simply will use second server to execute all active checks and present the result to the “fronend” server I’m sure I’ll get the same latency.
Of course if I could distribute all active checks over 2 or 3 other systems and represent the results on the frontend system it will help, but I mean does it really necessary?

as I understood I should carefully look at the performance of my check scripts and retries but current situation is following:

Check Execution Time: 0.00 sec 50.08 sec 0.699 sec Check Latency: 0.00 sec 333.08 sec 309.609 sec Percent State Change: 0.00% 53.09% 0.50%

I mean in average the Execution time is also less then a second, there are some 20 or 30 checks that take more then 1 second, but it’s a very small percantage from 5500 checks.

And I have constantly around 100 problem check, that are allready in “hard NOK” status and they should not influence on the latency anyway, they are scheduled according their normal check interval.

ciberkot · November 14, 2005, 3:17pm

forgot to say,
thanks, jakkedup, for the comments and info

jakkedup · November 14, 2005, 3:34pm

"But trully said I don’t see how the distribution installation helps me in my case"
It will help since the “Central Nagios” server will be making 5000 passive checks. In other words, it really won’t perform any checks at all. It will simply be reading the external command file over and over. Passive checks are processed faster than active checks since the output is already there, nagios only has to parse it, etc. But it won’t have to actually make the check to get the output.

jakkedup · November 14, 2005, 3:38pm

"I have constantly around 100 problem check, that are allready in “hard NOK” status and they should not influence on the latency anyway"
Yes the will influence your latency if you still have active checks enabled for these services.
Either way, all services will be put in the schedule. If the check is enabled, we already know it will fail. Since it failed, it will then have to look at the retry value. retry set to 3 means that it perform that check 3 more times, and wastiing precious time on something that we already know is dead. So, if you know it’s dead, then acknowledge the problem so notifications are supressed(if you have notifications on) and disable active checks for all failed services/hosts. That way, they will be scheduled, yes, but the check will never be performed, and so, no time is wasted.

jakkedup · November 14, 2005, 3:45pm

nagios.sourceforge.net/docs/2_0/tuning.html
There are some good tips on this page.
One that you could easily do, if you have enough ram is the ramdisk for the status.log file.

This from the docs:
“Use passive checks when possible. The overhead needed to process the results of passive service checks is much lower than that of “normal” active checks, so make use of that piece of info if you’re monitoring a slew of services.”

So you see, the docs say that passive checks are a BIG help.

Another from the doc is your check_ping (they state a host check, but do it for your ping service check) I did this trick and it really does help. But I set mine to 2 packets, instead of the default of 5 packets.
“Optimize host check commands. If you’re checking host states using the check_ping plugin you’ll find that host checks will be performed much faster if you break up the checks. Instead of specifying a max_attempts value of 1 in the host definition and having the check_ping plugin send 10 ICMP packets to the host, it would be much faster to set the max_attempts value to 10 and only send out 1 ICMP packet each time.”

Do NOT schedule regular checks of hosts. Just check services, and if a service fails, then it will check the host.

Edited Mon Nov 14 2005, 07:46AM ]

jakkedup · November 14, 2005, 3:49pm

I’d like to say that I’m amazed that you don’t have a higher latency than what you do, with that many services. I think if you tweak your ping check to send 2 packets, disable checks for services that you know are down, and dont plan to fix them TODAY and perhaps a few other little things, then you might get latency to approx 0

ciberkot · November 15, 2005, 2:24pm

I had latency 2000 seconds, but after some tunings I could get it back to 100-200

I’ll try to follow you recmendation.

ciberkot · November 15, 2005, 2:28pm

[quote=“jakkedup”]"But trully said I don’t see how the distribution installation helps me in my case"
It will help since the “Central Nagios” server will be making 5000 passive checks. In other words, it really won’t perform any checks at all. It will simply be reading the external command file over and over. Passive checks are processed faster than active checks since the output is already there, nagios only has to parse it, etc. But it won’t have to actually make the check to get the output.

[/quote]

yes, I see, but where will all those 5000 checks be done?
if they will be done by other nagios serever I’m sure I’ll get the same latency, because the “active” nagios will get the latency during the checks and it will notify the “Central Nagios” with the updates where the latency is allready exists.

otherway I don’t really understand how the passive checks work

jakkedup · November 17, 2005, 3:23pm

I’ll attempt to explain passive/active.
You could either take your car to get the timing checked out and tuned up, or you could simply take the “RESULTS” of a timing check to the dealer and have them tune it up. Or, go to the doctor to have them find out why you are sick and treat you, or go to the doctor with “RESULTS” of what is wrong with you and get treatment. It’s much faster to treat you, if they already know what is wrong with you.

The central server will be setup with every check that you need, but they will be passive checks only. These passive checks are not made by the central server, but will be made by the 4 or 5 other nagios machines that have setup to perform “ACtive” checks. Those 4 or 5 other machinges will then pass the data to the central nagios server as Paasive check results. So the 5000 service checks will be done by the 4 or 5 machines you have setup to perform active checks. These other machines will not have the check latency that you have now, since each machine will only have 1000 or so checks to make, rather than 5000. The Central server should be able to handle all the processing of these passive checks, since it won’t have anything else to do anyway. Remember, the central nagios server won’t be doing any active checks at all.

maarten · January 12, 2006, 8:18am

I think the misunderstanding here is caused by Ciberkot thinking that he will use only one server for the active check and one ‘Central Nagios Server’ (hence only moving the problem) and Jakkedup talking about using more than one server (e.g. 4 or 5) to perform the active checks (hence ballancing the load of the active checks on more servers).
Correct?

By the way, I have the same problem that Ciberkot describes. I’m still in the process of installing Nagios, so I don’t have a lot of checks (about 21 service checks and 171 hosts that will be checked).
The CPU and memroy usage of my server (which is a pretty fast server (brand new)) are very low, so that’s not the bottleneck.

The following data is displayed in my performance cgi:

------------------ Active host checks ------------------
Check Execution Time: 0.01 sec - 3.02 sec - 0.470 sec
Check Latency: 0.00 sec - 220.00 sec - 213.627 sec
Percent State Change: 0.00% - 0.00% - 0.00%

------------------ Active service checks ------------------
Check Execution Time: 0.01 sec - 2.55 sec - 0.421 sec
Check Latency: 218.46 sec - 220.99 sec - 220.211 sec
Percent State Change: 0.00% - 0.00% - 0.00%

------------------ Some settings from nagios.cfg ------------------
aggregate_status_updates=1
status_update_interval=10
nagios_user=nagios
nagios_group=nagios
enable_notifications=1
execute_service_checks=1
accept_passive_service_checks=1

execute_host_checks=

accept_passive_host_checks=

enable_event_handlers=1
log_rotation_method=d
log_archive_path=/usr/local/nagios/var/archives
check_external_commands=1
command_check_interval=-1
command_file=/usr/local/nagios/var/rw/nagios.cmd
downtime_file=/usr/local/nagios/var/downtime.log
comment_file=/usr/local/nagios/var/comment.log
lock_file=/usr/local/nagios/var/nagios.lock
retain_state_information=1
state_retention_file=/usr/local/nagios/var/status.sav
retention_update_interval=60
use_retained_program_state=1
use_syslog=0
log_notifications=1
log_service_retries=1
log_host_retries=1
log_event_handlers=1
log_initial_states=0
log_external_commands=1
log_passive_checks=1

global_host_event_handler=

global_service_event_handler=

sleep_time=1
service_inter_check_delay_method=s
max_service_check_spread=5
service_interleave_factor=s
max_concurrent_checks=0
service_reaper_frequency=10

host_inter_check_delay_method=

max_host_check_spread=

interval_length=10

auto_reschedule_checks=

auto_rescheduling_interval=

auto_rescheduling_window=

use_agressive_host_checking=0
enable_flap_detection=0
low_service_flap_threshold=5.0
high_service_flap_threshold=20.0
low_host_flap_threshold=5.0
high_host_flap_threshold=20.0
soft_state_dependencies=0
service_check_timeout=60
host_check_timeout=30
event_handler_timeout=30
notification_timeout=30
ocsp_timeout=5

ochp_timeout=

perfdata_timeout=30
obsess_over_services=0

ocsp_command=

obsess_over_hosts=

ochp_command=

process_performance_data=1
host_perfdata_command=cmdo_nagiosgraph_host

service_perfdata_command=

host_perfdata_file=

service_perfdata_file=

host_perfdata_file_template=

service_perfdata_file_template=

host_perfdata_file_mode=

service_perfdata_file_mode=

host_perfdata_file_processing_interval=

service_perfdata_file_processing_interval=

host_perfdata_file_processing_command=

service_perfdata_file_processing_command=

check_for_orphaned_services=1
check_service_freshness=1
freshness_check_interval=60

Do you perhaps see some similarities or differences in configuration?

With regards,
Maarten
Edited Wed Jan 11 2006, 03:26PM ]

maarten · January 12, 2006, 11:22am

Ok, I fixed my problem…

Turns out that in the one service template I use the most, for some reason the option “parallelizable” was set to “0”, causing all my checks to run successive is stead of parallel.

Perhaps that is your problem to? Easy to find by just running “grep “par” /usr/local/nagios/etc/*.cfg”.
Edited Wed Jan 11 2006, 06:26PM ]

jakkedup · January 12, 2006, 5:56pm

interval_length=10, why in the world would you do that?
You mean to tell me, that you have a reason to check something every 10 seconds, instead of 5 minutes, or perhaps, maybe 1 minute?

goody · January 18, 2006, 3:48am

I’m having the same problems with huge service check latency. The CPU and memory is not bottlenecked. I’m checking 209 hosts and 548 services, with maybe 30% of the services having a check interval of 2 minutes, the rest are 5 or 10 minutes.

Metric Min. Max. Average
Check Execution Time: 0.04 sec 30.98 sec 3.196 sec
Check Latency: 103.88 sec 417.85 sec 358.301 sec
Percent State Change: 0.00% 51.05% 0.39%

check_external_commands=1
command_check_interval=-1
event_broker_options=-1
use_syslog=1
log_notifications=1
log_service_retries=1
log_host_retries=1
log_event_handlers=1
log_initial_states=0
log_external_commands=1
log_passive_checks=1
service_inter_check_delay_method=s
max_service_check_spread=15
service_interleave_factor=s
host_inter_check_delay_method=s
max_host_check_spread=10
max_concurrent_checks=0
service_reaper_frequency=2
auto_reschedule_checks=0
auto_rescheduling_interval=30
auto_rescheduling_window=180
sleep_time=0.25
service_check_timeout=60
host_check_timeout=30
event_handler_timeout=30
notification_timeout=30
ocsp_timeout=5
perfdata_timeout=5
retain_state_information=1
retention_update_interval=60
use_retained_program_state=1
use_retained_scheduling_info=0
interval_length=60
use_aggressive_host_checking=0
execute_service_checks=1
accept_passive_service_checks=1
execute_host_checks=1
accept_passive_host_checks=1
enable_notifications=1
enable_event_handlers=1
process_performance_data=1
host_perfdata_command=process-host-perfdata
service_perfdata_command=process-service-perfdata
obsess_over_services=0
check_for_orphaned_services=0
check_service_freshness=0
service_freshness_check_interval=60
check_host_freshness=0
host_freshness_check_interval=60
aggregate_status_updates=1
status_update_interval=15
enable_flap_detection=0
low_service_flap_threshold=5.0
high_service_flap_threshold=20.0
low_host_flap_threshold=5.0
high_host_flap_threshold=20.0
date_format=us

I tried putting the status file in ramdisk like the tuning document recommended, with no improvement. There are no parallelization=0 directives in the service check definitions.

I understand how passive checks work and that we could migrate to a distributed architecture, but I’m puzzled why Nagios just won’t launch more processes with all the CPU and memory it has available and have the service check latency be reasonable.

I have yet to disable host checks, but this seems like a bandage and not a real solution as we need to put more hosts and service checks on this system. Our previous meager IDE based PIII 500 Nagios 1.x box with about 80% of the hosts and services performed better than our new Nagios 2.0 box which is a 3 GHz Xeon.

Any thoughts? Thanks in advance.

jakkedup · January 19, 2006, 4:56pm

use_syslog=1 Why? Seems redundant.
log_notifications=1
log_service_retries=1
log_host_retries=1
log_event_handlers=1
log_external_commands=1
log_passive_checks=1
The last 6, Why? For debugging yes, but in normal use, I don’t see the need.
service_reaper_frequency=2 Why? 10 seconds should have been fine.
execute_host_checks=1 Do you really have a need for that?

Other than that, you must have a bad plugin that is causing your trouble which is evident from the “30.98 sec” max check exectution time. Find the culprit and fix it.

goody · January 19, 2006, 8:14pm

I’ve gotten the service check latency down to 100 seconds by setting host check interval on everything to 0 and I took sleep_time down to .1 which seemed to have the biggest impact. Service_reaper_frequency at 2 was an attempt to squeeze more performance out.

We have some custom plugins that do multiple SNMP queries to enumerate alarms on some equipment, plus some of the network elements are on high latency satellite links which doesn’t help. Our average service check execution time isn’t outrageous, though.

I’ll try shutting off some logging to see if we get some gains, I appreciate your feedback.

I’ve seen several posts in other forums regarding the same issues that we’re having. It seems like the Nagios scheduler is well written to conserve CPU power but it won’t launch checks quick enough to take advantage of hefty hardware. I’m looking at the source code to hopefully better understand how the internals work.

Thanks again

jakkedup · January 19, 2006, 8:58pm

Goody,

SERVICE CHECK REAPER FREQUENCY

This is the frequency (in seconds!) that Nagios will process

the results of services that have been checked.

service_reaper_frequency=10
So, changing it to 2 has done nothing but made it more difficult for nagios. By changing it from t0 to 2, you have forced nagios to STOP making service checks and process the RESULTS of checks that is has made in the last 2 seconds. Bad thing to do.

sleep_time=1
This is the number of seconds that Nagios will sleep before checking to see if the next service or host check in the scheduling queue should be executed. Note that Nagios will only sleep after it “catches up” with queued service checks that have fallen behind.

Pay attention to what it says about sleep. It will only use this value if it catches up with qued up service checks. Since you are already fighting a battle of having high latency, then this value has done nothing. I think it was a co-incidence that you saw a difference.

You have an average service check exectuion time of 3.196 sec.
You have 548 services for a total of 1753 seconds or 29 minutes. Since you have your check interval set to 5 minutes, it’s going to be impossible to accomplish all the checks in that time frame.

It’s imperative, that you address the issue of 3.196 seconds average check execution time. I have about the same # of active checks as you, but the average time per check is .384 sec.
519 active checks at .384 seconds per check is 197 seconds or 3.2 minutes. I have a 5 minute check interval so I have plenty of time to do other things like process the results of checks, look at the external command file, etc. I also have time to process 653 passive checks.