nagios-2.6-1
I have one main server running passive checks, and two distributed servers running active checks (couple of thousand each) and reporting to the main one using nsca.
The services updates are slow on distributed servers however, looking at the Last check i often see them being 1 hour old.
On the main server , I have active checks off (there are some services that I do have it on, but not the ones that are checked from dist boxes), in nagios.cfg - obsess_over_services=0, check_for_orphaned_services=0, check_service_freshness=1
On the distributed server I have in nagios.cfg: obsess_over_services=1, check_for_orphaned_services=0, check_service_freshness=0, enable_flap_detection=0
and for services:
I am looking at the latency for Service checks and host checks, it looks huge on the destributed server, it is a powerfull box and nagios is the only resource intensive application running there.
for services it is
Since the host checks have no latency, it doesn’t look to be a problem, only services are effected.
I am still not sure about the cause. Any help will be appreciated.
I have set for a default host max_check_attempts to 2, made sure that the number of pings is 1 in check-host-alive.
Set normal_check_interval on distributed servers to 5 insted of 3.
There are no problems nagios have to deal with. CPU utilization is not more then 10%
My average check execution time is 1.3 sec and 3000 services to check (come custom bash scripts). If they are not running concurrently it would take 65 min to execute. But the way I understand the checks are executed at the same time - nagios.cfg:max_concurrent_checks=0, so they should be able to get done in 5 minutes, if there are enough of resources.
Do I understand it right, is there something else I could look at in configuration, or the only way at this point is to optimize the checks?
Any suggestions, recomendations? Anyone?
After the cchanges I described earler, I got a slight improvement.
<= 1 minute: 32 (1.0%)
<= 5 minutes: 216 (7.0%)
<= 15 minutes: 793 (25.5%)
<= 1 hour: 3107 (100.0%)
I also found that submit_check_result is timing out sometimes, not too often, every 30 minuts or so, here is what I see in the logs:
[1172600830] Warning: OCSP command ‘/usr/local/lib/nagios/plugins/submit_check_result *** ‘check_ssh’ 0 ‘SSH OK - OpenSSH_3.6.1p2 (protocol 1.99)’’ for service ‘check_ssh’ on host ‘***’ timed out after 5 seconds
I had to mmodify the script provided by the Nagios documentation, but it is not the problem her, her is what i use: