I am having an issue (nagios 3.2.1) where active checks are being run on some of my passive service checks even though the freshness threshold is not met. I do not see any messages stating that the service went stale (i.e. The results of service A on host B are stale by…), so I’m led to believe that for some other reason active checks are being run for this service. Also, as you can see below, the external command and passive checks are well within the threshold of when the alert gets flagged.
If I grep out the hostname in the nagios.log file, I find this section…
 EXTERNAL COMMAND: PROCESS_SERVICE_CHECK_RESULT;lx250.austin.ibm.com;NONCRITPROCESS;0;LoadL_master LoadL_startd
 PASSIVE SERVICE CHECK: lx250.austin.ibm.com;NONCRITPROCESS;0;LoadL_master LoadL_startd
 SERVICE ALERT: lx250.austin.ibm.com;NONCRITPROCESS;UNKNOWN;HARD;1;UNKNOWN: monitorProcesses-notRunning
Within 24 seconds of receiving an OK status for this service, it goes into an UNKNOWN hard state. Here is the config hierarchy for this service…
name generic-service ; The ‘name’ of this service template
active_checks_enabled 1 ; Active service checks are enabled
passive_checks_enabled 1 ; Passive service checks are enabled/accepted
parallelize_check 1 ; Active service checks should be parallelized (disabling this can lead to major performance problems)
obsess_over_service 1 ; We should obsess over this service (if necessary)
check_freshness 0 ; Default is to NOT check service 'freshness’
notifications_enabled 1 ; Service notifications are enabled
event_handler_enabled 1 ; Service event handler is enabled
flap_detection_enabled 1 ; Flap detection is enabled
failure_prediction_enabled 1 ; Failure prediction is enabled
process_perf_data 1 ; Process performance data
retain_status_information 1 ; Retain status information across program restarts
retain_nonstatus_information 1 ; Retain non-status information across program restarts
is_volatile 0 ; The service is not volatile
register 0 ; DONT REGISTER THIS DEFINITION - ITS NOT A REAL SERVICE, JUST A TEMPLATE!
The other oddity in this problem is that we only see this periodically. We are monitoring ~750 systems with this nagios server in this manner and at any given time only 1 or 2 seem to be in this state. Am I missing something obvious here? Has anyone else run into this before? I guess the most important question here is, is there another reason that nagios would run an active check on a passive service other than freshness checking? I really appreciate any insight anyone might have on this problem.