I think you are going about this the wrong way completely.
First, you defined your nagios.cfg interval length to be 180, instead of default 60, which makes for a headache when you talk with other nagios users, since they assume an interval is 1 minute.
The host checks should NEVER be done, since you will most likely have a service check of check_ping. Your problem is entirely the plugin itself that you have for your "real time check". I don't know what service check you are getting false possitives out of, but address that issue.
One thing I noticed, is that the check_ping or any service check for that matter, can actually be showing a device is down, and in reality, it is down, since it's not responding to a "ping" for example.
So, in that case, the best thing to do, is to make darn sure it won't "ping" for a very long time, before I send out an alert.
So, for the check_ping command, you may want to play with the -p option, which sets the number of packets sent. If I set -p 1, then surely, I'll have many many check_failures. But if I set -p 10, then it will send 10 packets, and if it fails after that many, the device MUST be down.
First, set nagios.cfg interval_length=60 like it was as default, and any other item you changed like the service_interleave_factor (leave it at s)
We want to make darn sure it's down, so we set services.cfg for the service check that is a problem to :
That way, if the check command fails, it will retry it every 1 minute for a total of 5 retries all 1 minute apart. If it still fails, then the service is definitely down. Or do you have devices that actually won't respond to a service check for longer than a 5 minute period? If so, then set max_check_attempts 10. That would be a total of 10 checks all 1 minute apart for a span of 10 minutes. Still fails after 10 retries? Then whatever plugin you are using is totally crap, or you do in fact, have a serious communication problem with your equipment.
Look at it this way. I have the same problem with 2 switch stacks in one of our hub closets. Both stacks, and various times, simply stop responding to a ping. They will do this for a span of 2 minutes or so at times, but since I have max_check_attempts 5, alerts are never sent, since the check_command works after the 2nd or 3rd retry (or 2 or 3 minutes later). But since nagios has shown me these failures, I am able to identify that there is a problem with this equipment. It's not the nagios setup, and at this time, I don't know if it's the router or the switches, or the fiber or what. Same in your situation. Nagios is telling you that you have a problem. You really need to figure out why your service_check is timing out, and showing the device is down. This tells me that your customers are also denied access. Unlike my situation with the switch stacks, there is no interuption for devices plugged into the switch. My only problem is getting the switch stack management IP addy to respond to a ping. So in my case, customers are NOT affected. In your case, they might be, and you need to fix that, not fix nagios.
Edited Thu Jan 05 2006, 08:10PM ]