The huge Check Latency

jakkedup · January 19, 2006, 9:12pm

Nagios looks at the number of checks it has to make, the intervals you have set, and alot of your other nagios.cfg settings and calculates how it can spread them out.
For example, give it 60 checks and 1 minute to perform them, and it will schedule them approximately 1 second apart. Give it 2 minutes, and it will give them 2 seconds apart. But if you have 3 seconds per check, then even with 60 checks, it can’t do it in 2 minutes. Simply impossible no matter what kind of code you have, nagios or ???.

Now, if you give it poor settings like service_reaper_frequency=2, which forces nagios to STOP making checks, and other things, then it takes those into account. Since it’s evident that it’s going to be impossible to make that many checks withing 5 minutes due to having a check exectuion time of 3 seconds, you have no choice but to find the bad service checks. It won’t matter what program you use, it’s your checks that are the problem.

goody · January 20, 2006, 5:17am

But I thought service checks are parallelized/multi-process, and a top/ps would indicate this is the case. Even if a check takes 10 seconds to execute, I can have 20 of them running at once, can’t I? This is Unix, not DOS

If I have 60 checks with a 1 minute interval, but it takes three seconds to complete a check, why not launch three checks every three seconds to get it all done in the one minute interval?

jakkedup · January 20, 2006, 4:14pm

Good point, but it would be a good idea of your calculations didn’t take that into account. That way, you have some cushion. I have no checks that take as long as your longest one does. That is the problem and you need to address it. My guess is, since you are using check_snmp, that you are NOT using the -m option in your commands.

So, paste your check_snmp defintion and also the service defintion that uses check_snmp, and lets’ see why it takes so long to run.

wsanders · January 24, 2006, 7:15pm

Hung service checks will skew latency big time. I’m monitoring about 725 services with an old dual-CPU PIII box, all active checks, and my latencies average less than a second with a max of 20 or so. When a host or two goes down and things start hanging, the average goes up to a few secs and the max rises as well. My Nagios uses about 10 to 12 minutes of CPU per day.

Make sure service_check_timeout is less than a minute, and you might want to tune it all the way down to 20 or 30 secs or so.

t seems like the Nagios scheduler is well written to conserve CPU power but it won’t launch checks quick enough to take advantage of hefty hardware.

Actually the paramaters supplied with the distro are conservative and won’t take advantage of hefty hardware.
For example service_reaper_frequency defaults to 10 sec which is about right for a smallish system. If I had a dual-Xeon I’d tune this down to 2 or 3 secs to see what happens. The CPU is there to be used!

auto_reschedule_checks: A new feature of v2.0 to “spread out” checks - but I’d just let Nagios “pile on” the processes if I had a fast box.

max_service_check_spread: I tuned this way down to 10 minutes because I was seeing some big delays on startup. This is Nagios 2.0rc2. I seem to have really confused Nagios during the time I was building out my config by stopping and restarting it frequently. Now I make frequenct config file changes, but only restart Nagios once or twice per day.