Service updates are slow on distributed server

paha · February 26, 2007, 8:33pm

nagios-2.6-1
I have one main server running passive checks, and two distributed servers running active checks (couple of thousand each) and reporting to the main one using nsca.

The services updates are slow on distributed servers however, looking at the Last check i often see them being 1 hour old.
On the main server , I have active checks off (there are some services that I do have it on, but not the ones that are checked from dist boxes), in nagios.cfg - obsess_over_services=0, check_for_orphaned_services=0, check_service_freshness=1

On the distributed server I have in nagios.cfg: obsess_over_services=1, check_for_orphaned_services=0, check_service_freshness=0, enable_flap_detection=0
and for services:

active_checks_enabled 1 passive_checks_enabled 1 parallelize_check 1 obsess_over_service 1 check_freshness 1 freshness_threshold 1200 notifications_enabled 0 event_handler_enabled 1 flap_detection_enabled 1 process_perf_data 0 retain_status_information 1 retain_nonstatus_information 1 register 0 is_volatile 0 max_check_attempts 3 normal_check_interval 3 retry_check_interval 1

I am not sure why the service status is being updated so slow at this point, it was working fine before I setup the distribution.

paha · February 26, 2007, 9:28pm

I am looking at the latency for Service checks and host checks, it looks huge on the destributed server, it is a powerfull box and nagios is the only resource intensive application running there.
for services it is

Check Execution Time: 0.03 sec 10.10 sec 1.507 sec Check Latency: 0.00 sec 3211.31 sec 1571.787 sec
for host checks:

Check Execution Time: 0.00 sec 10.02 sec 0.045 sec Check Latency: 0.00 sec 0.00 sec 0.000 sec

<= 1 minute: 0 (0.0%)
<= 5 minutes: 149 (4.8%)
<= 15 minutes: 602 (19.4%)
<= 1 hour: 2505 (80.6%)

Since the host checks have no latency, it doesn’t look to be a problem, only services are effected.
I am still not sure about the cause. Any help will be appreciated.

paha · February 26, 2007, 11:10pm

I have set for a default host max_check_attempts to 2, made sure that the number of pings is 1 in check-host-alive.
Set normal_check_interval on distributed servers to 5 insted of 3.

There are no problems nagios have to deal with. CPU utilization is not more then 10%
My average check execution time is 1.3 sec and 3000 services to check (come custom bash scripts). If they are not running concurrently it would take 65 min to execute. But the way I understand the checks are executed at the same time - nagios.cfg:max_concurrent_checks=0, so they should be able to get done in 5 minutes, if there are enough of resources.
Do I understand it right, is there something else I could look at in configuration, or the only way at this point is to optimize the checks?

paha · February 27, 2007, 6:46pm

Any suggestions, recomendations? Anyone?
After the cchanges I described earler, I got a slight improvement.
<= 1 minute: 32 (1.0%)
<= 5 minutes: 216 (7.0%)
<= 15 minutes: 793 (25.5%)
<= 1 hour: 3107 (100.0%)

I also found that submit_check_result is timing out sometimes, not too often, every 30 minuts or so, here is what I see in the logs:
[1172600830] Warning: OCSP command ‘/usr/local/lib/nagios/plugins/submit_check_result *** ‘check_ssh’ 0 ‘SSH OK - OpenSSH_3.6.1p2 (protocol 1.99)’’ for service ‘check_ssh’ on host ‘***’ timed out after 5 seconds
I had to mmodify the script provided by the Nagios documentation, but it is not the problem her, her is what i use:

[code]#!/bin/bash

/usr/bin/printf “%s\t%s\t%s\t%s\n” “$1” “$2” “$3” “$4” | /usr/sbin/send_nsca nagios -c /etc/nagios/send_nsca.cfg[/code]

For now I changed - nsca.cfg:aggregate_writes=1, it was 0. Lets see if it helps.
I am not expecting this change to create a sugnificant improvement.

The latency is still huge. I would apretiate any help.