We currently have nagios running without state retention, such that when it starts it requeues all service checks. At this moment this is the method we wish to use for a number of reasons.
Every time we restart Nagios all of the service checks in the web gui get set to “pending”. Over time, each is checked and has a proper status. However, every time we restart nagios we see some service checks that state that the service check is scheduled to be performed at a certain time, but that time is in the past, as if nagios failed to execute the service check when it was scheduled. The result is that nagios never starts performing checks on that service. The only way to get it to start checking the service is to manually queue it. From that point on it rechecks the service properly.
Has this ever happened to anyone else? Why do you suppose nagios has the service check scheduled, but never performs it?
Are you sure you have execute_service_checks=1? set? This option will tell Nagios to check each service when it restarts. If you do, then I’d check your nagios.log file. That’s usually one of the best ways to locate errors.
Yes, execute_service_checks=1. And 95% of our service checks execute properly. The problem, again, is that when nagios starts up, and initially queues services for their first check after the restart, it seems to fail to execute certain (random) scheduled service checks. As such, it never reschedules those service checks for an other check, and they sit on the host view stating that they will be executed at a certain date in the past.
The question is why nagios would be failing to even try to execute these few service checks (it’s not failing AT executing, it’s failing TO execute, meaning that it doesn’t appear to be calling the service check at all).
The resolution came in two parts. The first was that I lowered the service_reaper_frequency to 1 from 10 in nagios.cfg. This perhaps allows nagios an easier time of processing data, instead of doing a lot at once. I’m not really sure. In any case, doing this seemed to keep service checks from falling off the queue.
The other setting I made was to enable check_for_orphaned_services in nagios.cfg. This is kind of the back-up plan, and it appears that service checks that were supposed to have been rerun, but didn’t run or return results, will be immediately requeued.
Since implementing these changes our service check schedule has been keeping up to date.