I am having an issue (nagios 3.2.1) where active checks are being run on some of my passive service checks even though the freshness threshold is not met. I do not see any messages stating that the service went stale (i.e. The results of service A on host B are stale by…), so I’m led to believe that for some other reason active checks are being run for this service. Also, as you can see below, the external command and passive checks are well within the threshold of when the alert gets flagged.
If I grep out the hostname in the nagios.log file, I find this section…
[1269874800] EXTERNAL COMMAND: PROCESS_SERVICE_CHECK_RESULT;lx250.austin.ibm.com;NONCRITPROCESS;0;LoadL_master LoadL_startd
[1269874802] PASSIVE SERVICE CHECK: lx250.austin.ibm.com;NONCRITPROCESS;0;LoadL_master LoadL_startd
[1269874824] SERVICE ALERT: lx250.austin.ibm.com;NONCRITPROCESS;UNKNOWN;HARD;1;UNKNOWN: monitorProcesses-notRunning
Within 24 seconds of receiving an OK status for this service, it goes into an UNKNOWN hard state. Here is the config hierarchy for this service…
define service{
name generic-service ; The ‘name’ of this service template
active_checks_enabled 1 ; Active service checks are enabled
passive_checks_enabled 1 ; Passive service checks are enabled/accepted
parallelize_check 1 ; Active service checks should be parallelized (disabling this can lead to major performance problems)
obsess_over_service 1 ; We should obsess over this service (if necessary)
check_freshness 0 ; Default is to NOT check service 'freshness’
notifications_enabled 1 ; Service notifications are enabled
event_handler_enabled 1 ; Service event handler is enabled
flap_detection_enabled 1 ; Flap detection is enabled
failure_prediction_enabled 1 ; Failure prediction is enabled
process_perf_data 1 ; Process performance data
retain_status_information 1 ; Retain status information across program restarts
retain_nonstatus_information 1 ; Retain non-status information across program restarts
is_volatile 0 ; The service is not volatile
register 0 ; DONT REGISTER THIS DEFINITION - ITS NOT A REAL SERVICE, JUST A TEMPLATE!
}
define service{
name ll_passive_service_template
active_checks_enabled 0
passive_checks_enabled 1
flap_detection_enabled 0
check_period 24x7
max_check_attempts 1
normal_check_interval 5760
retry_check_interval 1
check_freshness 0
contact_groups ll_admins_warning
check_command check_dummy!0
notification_interval 60
notification_period 24x7
notification_options w,u,c,r
stalking_options w,c,u
use generic-service
register 0
}
define service{
hostgroup LL_DBATCH
service_description NONCRITPROCESS
contact_groups ll_batch_warning
use ll_passive_service_template
notification_interval 0
check_freshness 1
freshness_threshold 2820
check_command trigger_deadman_arg!monitorProcesses-notRunning
}
The other oddity in this problem is that we only see this periodically. We are monitoring ~750 systems with this nagios server in this manner and at any given time only 1 or 2 seem to be in this state. Am I missing something obvious here? Has anyone else run into this before? I guess the most important question here is, is there another reason that nagios would run an active check on a passive service other than freshness checking? I really appreciate any insight anyone might have on this problem.