OK to WARNING to CRITICAL

I may be missing something in a config, but for the life of me I can’t seem to find it. Currently, when a service doesn’t respond properly after 1 try, it inmediately goes CRITICAL. I would like for the service to go to WARNING for failures 1 and 2, and switch to CRITICAL on the third failure. How can I accomplish this?

Depending on how you have a service check configured and what service check you are making, it may never go to warning (plugin dependant) I could write a plugin that never goes to critical but only to a warning state. Does that make sense? It all depends on the plugin.

But assuming that we are using a check like check_snmp, that has a -w (warning) and -c (critical) state, when the service check is made, let’s assume it gives a critical for a return value OK? Nagios will show it as critical, but depending on how I have the retry setup, it will not send out notifications yet, since it’s still in the SOFT state. Mine checks are setup to have to retry FIVE times, and so on the 5th try, if the check STILL fails, it will then and only then, send out a notification. It’s important that you set the retry level high enough, because you don’t want to be emailed with every single failure if you haven’t at least tried the check 5 times or so.

It makes no sense to write a plugin like check_snmp, that allows you to check disc space on some other machine, that let’s you set warning and critical levels, and now you want to change that plugin to report a “warning” only, on the first 3 checks, and go critical on the 4th check? Why would you want it to go critical when only the warning level of disc space has been reached? I don’t get it.

I realize fully that what you wish for is nagios to display the yellow warning, since the check only failed once or twice, and then on the 4th check to go red-critical. But you might have failed to notice that the red critical message was still in the “SOFT” state, and has not yet sent out any notifications.

Think about it.

I understand that I can give different thresholds on snmp checks. That would be the best for that situation. However, what I’ve seen done before in a previous implementation is specifically on http, ssh and termserv checks. In my specific situation, I want nagios to go yellow/warning state when the webserver doesn’t respond the first time. Then, on the third time, I want it to go red and notify me (which in services I have set to notify me when it goes critical and when it recovers). However, right now I have it go critical when it first goes down. Allowing me to have a warning state gives me the chance to see if it’s a temporary flap, and not blast the alarm until it get to it’s third try. Here’s my services entry for this check:

define service{
use generic-service
host_name www
service_description HTTP
is_volatile 0
check_period 24x7
max_check_attempts 3
normal_check_interval 1
retry_check_interval 1
contact_groups admins
notification_interval 240
notification_period 24x7
notification_options c,r
check_command check_http
}

I’ve seen it done, and it was done system wide (one over 800 servers) when it pertained to tcp checks, not snmp checks. If what I’m asking is not feasible, that’s fine, I’ll figure out something. Thanks for the reply though :slight_smile:

you’ve seen it on nagios? ask the webmaster where you’ve seen this :slight_smile: (and please report how)

AFAIK it’s not possible without modifying the nagios source (specifically the reporting CGIs)

Luca

I really wish I was still in contact with him. I’ll see if I can find his email and find out.

The other issue i’m having is that I get notified on attempt 1/3, instead of 3/3 when it goes HARD critical. Any ideas?

what did you set as max_check_attempts ?

Luca

max_check_attempts is set to 3.

uh… strange… i get a notification only after the third attempt…
don’t know what this could be…

Luca

Ahh, there is your problem right here in your config file.
" max_check_attempts 3
normal_check_interval 1
retry_check_interval 1 "

Your normal check interval is 1 minute. If the service check fails, then it will retry the check 3 more times. Your retry check interval is set to 1 minute. So after that one minute it will retry the check again and then again once more and then at only that time, it will send out a notification and give a hard failure.

BUT, and the big BUT is that your normal check interval is ALSO set to ONE MINUTE, which ends up stepping on the retry.

I think it’s overkill to check every one minute myself, but if you must, you must. But then you would have to live without the service check EVER being able to be rechecked, since the smallest nagios interval is 1.

In other words, you can’t define it like this:
max_check_attempts 3
normal_check_interval 1
retry_check_interval .30 (not possible)

Suggestion is this.
max_check_attempts 3
normal_check_interval 5
retry_check_interval 1

The only problem with that is I need to know within the minute if the web service is down. This is a very important site, and needs to be up as it provides a public service. I guess I can set the normal to 1, and the retry to 2…since I’ll know if it’s back because I fixed it :wink:

If you set normal to 1 and retry to 2, then the retry will NEVER occur.

If you need per minute status, then set normal to 1 and max check attempts to 1. At that setting, there will be no retries attempted. As you have it now, nagios does attempt to retry, but ends up getting stepped on by the NORMAL check interval.

In your original problem statement, you wanted it to go to warning first, then to critical, and now you state you want it checked every minute and if it fails to immediately notify you. There is nothing wrong with that, but you will have to live with the reality that since you have a normal interval of 1 minute, then there will be NO possibility of a retry because there is nothing less that 1.

except 0 of course, which is essentially what you have. But I still suggest that you change max check attempts to 1, since that is really how you want it.

You could change your nagios.cfg file setting to:

INTERVAL LENGTH

This is the seconds per unit interval as used in the

host/contact/service configuration files. Setting this to 60 means

that each interval is one minute long (60 seconds). Other settings

have not been tested much, so your mileage is likely to vary…

interval_length=10

That would make each interval 10 seconds long.
Then change your services.cfg to
normal_check_interval 6 (which would be 60 seconds)
retry_check_interval 1 (which would be 10 seconds)
and finally
max_check_attempts 5 (which would make for 50 seconds.)

That way, the http checks would be performed every minute just like you wanted.
If it fails, it will retry the check for a total of 5 times. It would retry the check every 10 seconds, 5 times.

That would do it.

Don’t forget to modify any other files like this interval length affects, like the hosts.cfg for the notification_interval which is 120 by default.

Perfect! That may be the ticket…thanks!

Hi,
I can do it correctly that you describe.
BUT, I have another problem:
I set some treshold values for check_nt MEMUSE. For example warning treshold is 80 and critical treshold is 85.
But after 3 poll my warning alarm turns to criticcal.
What I want:
If value is between 80 and 85 I get always warning. If value reaches up to 85 i get always critical.
How can I do that?