Check intervals for CPU monitoring


#1

Hi all. I’m adding some windows servers to be monitored in my nagios install. Some of these servers have CPU’s that spike and remain high due to data processing at set times. Is there a way to tell Nagios to only alert if the CPU stays above a certain threshold for a certain period of time? Kind of like “if CPU1=90% at 5:00pm, don’t alert, but if CPU1 is=90% at 5:15pm, send an alert”. I basically want nagios to ignore the spikes, but warn if the CPU remains in the spike for an extended period of time.


#2

nagios.sourceforge.net/docs/3_0/ … ml#service
max_check_attempts


#3

Hmmm. I think I may be able to rig this to do what I need. Is there any way to set an actual time frame though? Also, call me noobie, but how do I figure out exactly how much time is in between the host checks and services checks for each individual host/service?


#4

host checks shouldn’t be done automatically if the services are up.
interval between service checks is check_interval.


#5

Well I’ve gotten myself a bit confused. I finally have a set number though so maybe someone can help. I basically want to tell Nagios to only send an alert if a server CPU remains above XX% for one hour. So in my service definition I added the following two lines to the monitor CPU service

max_check_attempts 6
check_interval 10

Those directives will cause Nagios to check every 10 minutes when in a non-OK state then send an alert correct? Or would I need retry_interval instead of check_interval?


#6

yes, but have a look for yourself :slight_smile:
nagios.sourceforge.net/docs/3_0/ … ml#service


#7

In your original post you wanted to check the CPU level, but there are known times where you know the CPU is higher.
There is a “check_period” directive in the service definitions, if you wanted that control, you will need to configure the service twice, with different load levels for different times.

It increases the size of your configs, but if you want the fine control, of time-of-day levels. That’s how you go about it.

define timeperiod{
        timeperiod_name highCPULoadTime
        alias           High CPU Load Time
        sunday          17:00-17:15
        monday          17:00-17:15
        tuesday         17:00-17:15
        wednesday       17:00-17:15
        thursday        17:00-17:15
        friday          17:00-17:15
        saturday        17:00-17:15
        }
define timeperiod{
        timeperiod_name lowCPULoadTime
        alias           Low CPU Load Time
        sunday          00:00-17:00,17:15-24:00
        monday          00:00-17:00,17:15-24:00
        tuesday         00:00-17:00,17:15-24:00
        wednesday       00:00-17:00,17:15-24:00
        thursday        00:00-17:00,17:15-24:00
        friday          00:00-17:00,17:15-24:00
        saturday        00:00-17:00,17:15-24:00
        }
define service{
        host_name	             myServer
        service_description  CPULoad
        use                            serviceTemplate  # I'm big on templates !
        check_period             highCPULoadTime
        max_check_attempts         6
        check_interval                 10
...
}
define service{
        host_name	             myServer
        service_description  CPULoad
        use                            serviceTemplate  # I'm big on templates !
        check_period             lowCPULoadTime
        max_check_attempts         3
        check_interval                 10
...
}

Something like that…

GR


#8

@luca–Ok so I double checked the documentation and it seems to me that I should add in the retry_interval. As I understand it, if I set the definition like so…

max_check_attempts 6
retry_interval 10
check_interval 3

It will normally check the CPU every 3 minutes except when it detects a non-OK state. In which case Nagios will check the first time, then 6 subsequent times, once every 10 minutes. Am I correct on that? The wording in the documentation confuses me a bit, as I don’t see much difference between check_interval and retry_interval. I’d appreciate it also if you could enlighten me to the difference.

@groth–Thanks for the in depth answer. That idea is actually really clever, I may use it later on. Right now what I’m working on is for a client and I’m just looking for a simple solution.


#9

What you did is correct, but usually check_interval is higher than the retry_interval as mostly you want fewer checks and then a couple of rapid checks when something isn’t ok before sending the notification (mainly being sure it’s not a single timeout).


#10

Ah, so if a check_interval(normal check) returns non-OK, then the retry_interval directive kicks in and does rapid checks until it exhausts the max_check_attempts, at which point it sends a notification unless the check returns OK?


#11

yes.