Hi all. I’m adding some windows servers to be monitored in my nagios install. Some of these servers have CPU’s that spike and remain high due to data processing at set times. Is there a way to tell Nagios to only alert if the CPU stays above a certain threshold for a certain period of time? Kind of like “if CPU1=90% at 5:00pm, don’t alert, but if CPU1 is=90% at 5:15pm, send an alert”. I basically want nagios to ignore the spikes, but warn if the CPU remains in the spike for an extended period of time.
nagios.sourceforge.net/docs/3_0/ … ml#service
max_check_attempts
Hmmm. I think I may be able to rig this to do what I need. Is there any way to set an actual time frame though? Also, call me noobie, but how do I figure out exactly how much time is in between the host checks and services checks for each individual host/service?
host checks shouldn’t be done automatically if the services are up.
interval between service checks is check_interval.
Well I’ve gotten myself a bit confused. I finally have a set number though so maybe someone can help. I basically want to tell Nagios to only send an alert if a server CPU remains above XX% for one hour. So in my service definition I added the following two lines to the monitor CPU service
max_check_attempts 6
check_interval 10
Those directives will cause Nagios to check every 10 minutes when in a non-OK state then send an alert correct? Or would I need retry_interval instead of check_interval?
yes, but have a look for yourself
nagios.sourceforge.net/docs/3_0/ … ml#service
In your original post you wanted to check the CPU level, but there are known times where you know the CPU is higher.
There is a “check_period” directive in the service definitions, if you wanted that control, you will need to configure the service twice, with different load levels for different times.
It increases the size of your configs, but if you want the fine control, of time-of-day levels. That’s how you go about it.
define timeperiod{
timeperiod_name highCPULoadTime
alias High CPU Load Time
sunday 17:00-17:15
monday 17:00-17:15
tuesday 17:00-17:15
wednesday 17:00-17:15
thursday 17:00-17:15
friday 17:00-17:15
saturday 17:00-17:15
}
define timeperiod{
timeperiod_name lowCPULoadTime
alias Low CPU Load Time
sunday 00:00-17:00,17:15-24:00
monday 00:00-17:00,17:15-24:00
tuesday 00:00-17:00,17:15-24:00
wednesday 00:00-17:00,17:15-24:00
thursday 00:00-17:00,17:15-24:00
friday 00:00-17:00,17:15-24:00
saturday 00:00-17:00,17:15-24:00
}
define service{
host_name myServer
service_description CPULoad
use serviceTemplate # I'm big on templates !
check_period highCPULoadTime
max_check_attempts 6
check_interval 10
...
}
define service{
host_name myServer
service_description CPULoad
use serviceTemplate # I'm big on templates !
check_period lowCPULoadTime
max_check_attempts 3
check_interval 10
...
}
Something like that…
GR
@luca–Ok so I double checked the documentation and it seems to me that I should add in the retry_interval. As I understand it, if I set the definition like so…
max_check_attempts 6
retry_interval 10
check_interval 3
It will normally check the CPU every 3 minutes except when it detects a non-OK state. In which case Nagios will check the first time, then 6 subsequent times, once every 10 minutes. Am I correct on that? The wording in the documentation confuses me a bit, as I don’t see much difference between check_interval and retry_interval. I’d appreciate it also if you could enlighten me to the difference.
@groth–Thanks for the in depth answer. That idea is actually really clever, I may use it later on. Right now what I’m working on is for a client and I’m just looking for a simple solution.
What you did is correct, but usually check_interval is higher than the retry_interval as mostly you want fewer checks and then a couple of rapid checks when something isn’t ok before sending the notification (mainly being sure it’s not a single timeout).
Ah, so if a check_interval(normal check) returns non-OK, then the retry_interval directive kicks in and does rapid checks until it exhausts the max_check_attempts, at which point it sends a notification unless the check returns OK?
yes.