I know nagios’ config contains a inter_check_delay_method value, representing how nagios should schedule service checks when it starts up. We’re currently using “smart”.
Also of note, we’ve currently got retain_state_information set to 0, because of the annoyance of service check definitions not actually changing when the system is reloaded. (We use a custom in-house nagios configuration manager).
We’ve got about 2500 service checks set up in nagios. When we restart the system it tends to take a while for all those services to get checked initially. The annoying part is that the “smart” distribution method when nagios restarts can take minutes to perform the first check of a service that usually gets checked each minute, based on the smart initial distribution.
The question I have is, if I use state retention, would that have any impact on nagios’ post-restart scheduling?
The real goal is to have service checks resume as quickly as possible after restart, almost as if the system was no restarted. Understandably that’s not going to happen.
The reason for this interest is that we save the performance data from many of our service checks, often checks that execute once per minute. But when nagios is restarted, some of those minutely checks don’t start getting checked again for 5 minutes. This amounts to missing performance data each time we restart nagios.
So, would it be foolish to change the inter_check_delay_method to a hard value? Would using state retaintion get service checks resuming any faster?
i suppose nobody here could asnwer that correctly as we didn’t develop Nagios…
I’d say chances look good. but without a test i wouldnt guarantee anything…
I don’t think it could get worse then what you have now… change the setup and try restarting it… at least you already know you could loose a couple of cycles of data.
I did some testing (Just posting here for those who are curious). Apparently whether or not state retaintion is enabled doesn’t matter in terms of when nagios will get the services scheduled when it restarts. It still takes several minutes for nagios to distribute the checks when it starts. So, using state retaintion is not going to keep my performance data from being interrupted.
I also experimented with changing the inter_check_delay_method. just to see how it worked. Turns out that won’t help me either. Even if I set the delay to .1 seconds between service checks, it still takes the same amount of time to finish executing, since nagios actually has to churn through all 2500 service checks, and that takes some time to accomplish. So, the delay is not, apparently, in the delay_method, but in the time to execute so many service checks.
Looks like there’s no real way around this issue other than, perhaps, to postpone reloads to a certain time of the night.
FWIW, the reason we currently don’t use state retaintion is because we have not yet implemented tools within our custom interface to execute the nagios external commands when certain things happen (notification enable/.disable, all the other device and service fields that read from the saved data instead of the new config entries.) So, I figure we’ll be saving state data soon, once we get those calls implemented.