I have a huge check latency, I think because all checks are scheduled immediately in nagios, around 3600 checks, so it takes quite some time to ‘catch up’.
Also, is there a way to list the check execution time for all hosts/services? I can see it when I drill down to the service level in the web interface. Is there somewhere I can grep / awk to get this information? My max check execution time is 120sec, so i want to find that bugger!
You’re also probably going to have a huge check latency with 3600 checks if your check intervals and retry intervals are all set to the same value. You might want to figure out which ones don’t need to be checked as often as others. In my case, I only check disk space on my hosts every hour (as I expect them to take a while to hit critical), whereas other, more critical services are checked once per minute.
Initially everything was scheduled to be checked every three minutes. I’ve changed all disk checks to 1/2 hourly and i’ve changed all other checks to 5 minutes from 3 and also changed the max retries from 3 to 2.
I’ve got the latency down from 15 minutes to about 9.5 minutes. But it’s still coming down.
I think the latency would be a lot better if the smart service interleave factor was spreading out the checks, instead of all host checks being scheduled at once, with all checks for a single host right after one another in the scheduling queue.
inter_check_delay_method=n is wrong, it should be s, correct.
3600/5= 720/minute which is 12 checks/sec or .08 seconds / check. When you consider that an average check execution time will not be lower than .08 seconds, you will always have a huge latency no matter what you do. My average execution time is .4 seconds with a max latency of 5 seconds. Of course, I only have 519 active checks and 653 passive checks.
My suggestion is for you to setup as many passive checks as possible. Any checks that you have left over that are active, you should then make another nagios box perform them and submit them as passive checks (distributed nagios setup, read the docs). Now your main nagios box is 100% passive and should be able to hanle the load, since it is no longer making any checks at all. All it has to do is process the external command file.
Is there a way to implement passive checks on Windows, without using snmp? I’ve seen the NC_Net V1_x package, is it any good? I only administer unix systems, so I can’t really play around on their boxes. Windows servers probably account for 75% of our systems.
Do you guys have ping as a standard check being executed every x minutes? If nagios pings every time a service has a problem state, should a get rid of a regularly scheduled ping?
i think you don’t want to get rid of ping… use it for graphing the network timings…
It gives at least a rough idea if soemthing goes wrong somewhere on the network
Is there a way to implement passive checks on Windows, without using snmp? I’ve seen the NC_Net V1_x package, is it any good? I only administer unix systems, so I can’t really play around on their boxes. Windows servers probably account for 75% of our systems.[/quote]
Took a look around the Nagios Exchange and found NSCA Win32Client, which is the Windows equivalent of the Unix utility send_ncsa. This will enable you to accept passive checks on your Nagios server.
I would seriously consider jakkedup’s recommendation of a distributed Nagios setup. It’ll seriously decrease your check latency there.
[quote=“ben”]
Is there a way to implement passive checks on Windows, without using snmp? I’ve seen the NC_Net V1_x package, is it any good? I only administer unix systems, so I can’t really play around on their boxes. Windows servers probably account for 75% of our systems.
Do you guys have ping as a standard check being executed every x minutes? If nagios pings every time a service has a problem state, should a get rid of a regularly scheduled ping?[/quote]
I use nc_net so try it out. there are other ways to make active checks on windows and submit as passive to nagios.
Correct, get rid of the service check ping, since the real thing you are interested in is if “apache is running” or whatever. At least, until you get the rest of nagios up and running 100%, then add them back in. As luca said, it helps also. I have nagiostat graphing the results of ping and it can determine problems on your network. So most all of my hosts are checked for a ping service check.