Active Check run within passive freshness threshold


#1

I am having an issue (nagios 3.2.1) where active checks are being run on some of my passive service checks even though the freshness threshold is not met. I do not see any messages stating that the service went stale (i.e. The results of service A on host B are stale by…), so I’m led to believe that for some other reason active checks are being run for this service. Also, as you can see below, the external command and passive checks are well within the threshold of when the alert gets flagged.

If I grep out the hostname in the nagios.log file, I find this section…
[1269874800] EXTERNAL COMMAND: PROCESS_SERVICE_CHECK_RESULT;lx250.austin.ibm.com;NONCRITPROCESS;0;LoadL_master LoadL_startd
[1269874802] PASSIVE SERVICE CHECK: lx250.austin.ibm.com;NONCRITPROCESS;0;LoadL_master LoadL_startd
[1269874824] SERVICE ALERT: lx250.austin.ibm.com;NONCRITPROCESS;UNKNOWN;HARD;1;UNKNOWN: monitorProcesses-notRunning

Within 24 seconds of receiving an OK status for this service, it goes into an UNKNOWN hard state. Here is the config hierarchy for this service…

define service{
name generic-service ; The ‘name’ of this service template
active_checks_enabled 1 ; Active service checks are enabled
passive_checks_enabled 1 ; Passive service checks are enabled/accepted
parallelize_check 1 ; Active service checks should be parallelized (disabling this can lead to major performance problems)
obsess_over_service 1 ; We should obsess over this service (if necessary)
check_freshness 0 ; Default is to NOT check service 'freshness’
notifications_enabled 1 ; Service notifications are enabled
event_handler_enabled 1 ; Service event handler is enabled
flap_detection_enabled 1 ; Flap detection is enabled
failure_prediction_enabled 1 ; Failure prediction is enabled
process_perf_data 1 ; Process performance data
retain_status_information 1 ; Retain status information across program restarts
retain_nonstatus_information 1 ; Retain non-status information across program restarts
is_volatile 0 ; The service is not volatile
register 0 ; DONT REGISTER THIS DEFINITION - ITS NOT A REAL SERVICE, JUST A TEMPLATE!
}

define service{
name ll_passive_service_template
active_checks_enabled 0
passive_checks_enabled 1
flap_detection_enabled 0
check_period 24x7
max_check_attempts 1
normal_check_interval 5760
retry_check_interval 1
check_freshness 0
contact_groups ll_admins_warning
check_command check_dummy!0
notification_interval 60
notification_period 24x7
notification_options w,u,c,r
stalking_options w,c,u
use generic-service
register 0
}

define service{
hostgroup LL_DBATCH
service_description NONCRITPROCESS
contact_groups ll_batch_warning
use ll_passive_service_template
notification_interval 0
check_freshness 1
freshness_threshold 2820
check_command trigger_deadman_arg!monitorProcesses-notRunning
}

The other oddity in this problem is that we only see this periodically. We are monitoring ~750 systems with this nagios server in this manner and at any given time only 1 or 2 seem to be in this state. Am I missing something obvious here? Has anyone else run into this before? I guess the most important question here is, is there another reason that nagios would run an active check on a passive service other than freshness checking? I really appreciate any insight anyone might have on this problem.


#2

Update - I tried setting the normal_check_interval to zero thinking that somehow active checks are getting run even though I have active_checks_enabled set to 0. This did not seem to help. I did notice another difference. We have an older nagios server running 3.0.3. If you go to an identical service on that system in the web interface, it says active checks are disabled and the box is red. However, on the 3.2.1 server, if I go to the service, it shows that active checks are enabled and the box is green. I then proceeded to view the config via the web GUI and in that section, it says that active checks are set to “No” for the service in question. Seems to be some discrepency within the gui as to whether or not the service is actively checked.


#3

Not sure if it is the case… but please try the first one…

http://forums.meulie.net/t/the-faq-and-common-solutions-thread/5241


#4

Thank you for the links about shutting down nagios and killing off processes. After I tried this, I noticed that there were several nsca processes running. Currently, I have nsca running under xinetd, so the nsca daemon process only starts when an event comes in so that it can process it and then it ends. Well, for whatever reason, some of these processes were not ending. So, I set up a cron to kill off nsca processes older than 5 minutes and this appears to have fixed the problem. I am now investigating running nsca as a constant running daemon rather than using xinetd. I think it might be more stable that way and possibly easier to maintain. Thank You again for pointing me in the right direction.


#5

Thanks to you for the update :slight_smile:


#6

Was just reading over this old post of mine and wanted to give a definitive answer to my problem. The issue was the normal_check_interval directive not being set to 0. For whatever reason, even though I had active_checks_enabled set to 0, it was still checking the service every 5760 minutes. Once set to 0 the problem went away. As for the NSCA processes, I have reconfigured our servers so that they run as daemons, however I still have to kill off old processes with a cron job. Between those 2 fixes, things are running fairly smoothly.


#7

Struggling with similar issues, trying to set the directive to 0 on the service to see if it helps.

NSCA is not an issue for me :slight_smile: