Timeout Notifications

Right, had an issue last week where the network between the Nagios machine and one of the monitored sites went down. Nagios correctly only notified me that the sites router was unreachable/down (not the hosts behind it). But it also sent a pile of plugin timeout notifications. These are spurious, and should not be sent. If the devices parent is down, of course the plugins will timeout so no notifications whatsoever should be sent, but they are.

I’ve checked the docs, my configs, nothing seems wrong. Any ideas?

Where does it say that if a host has a service check that gives an “unknown” error, that notifications will not be sent? Unknown is a valid state and I may want to know about unknown’s.
If you have a router down, and get notified as you did, you can also be notified about any devices that may ALSO be affected. Those devices may be in the state “unknown”, “unreachable”. It’s up to YOU to decide if you actually want those types of messages. I don’t, so I have my contact info set to:
service_notification_options w,c,r.
host_notification_options d,r

Well, they are technically service critical errors although they actually say plugin timeout. By saying they are spurious, I’m trying to say that I don’t want them to be sent. They are basically useless, as all of those hosts/services are already known to be having trouble, because their parent is down.

All of my hosts notification_options are d,r and my services c,r.

I’m basically wondering why I’m getting plugin timeout alerts (service critical alerts) when the parent host is “dead”. The Nagios docs state:

[blockquote]
Whenever a service check results in a non-OK status level, Nagios will attempt to check and see if the host that the service is running on is “alive”. Typically this is done by pinging the host and seeing if any response is received. If the host check commmand returns a non-OK state, Nagios assumes that there is a problem with the host. In this situation Nagios will “silence” all potential alerts for services running on the host and just notify the appropriate contacts that the host is down or unreachable.
[/blockquote]

That’s exactly the scenario I expected, i.e.:

1.) The service is non-OK
2.) Check the host
3.) Host non-OK
4.) Supress all service related alerts.

This didn’t happen, so either I missed something in the configs, or something else is going on. :shock:

ARe you getting host and/or service emails?
For the emails you are getting, double check the settings for that service/host and make sure you have NOT included the “u” in notifications_options. Are these notifications being sent due to an event handler script as is the case with evert’s most recent post?

You are correct, you should not be getting these, if your settings are truely correct. If they are, kill nagios, make sure they are all dead, ps -ef|grep nagios and that test again.

1.) The alerts that I think are incorrect, are service alerts. Not host alerts. I received no host alerts for them (expected behaviour)
2.) None of the service definitions have ‘u’ as a notification option. Only ‘c’ and ‘r’.
3.) The settings have been in place for a while now, but this is the first outage of this specific kind we’ve experienced so I’d never seen this behaivour.
4.) event_handler_enabled is set to ‘1’ for both hosts and services, but theres no event_handler defined, either at the host?service level, or globally. I’m not entirely familiar with event_handler, so this might be it?

Do contact definitions override host/service definitions? Just noticed all my contacts have ‘u’ specified for service and host notification, and am suspecting thats what caused it (although, why no host notifications in that case?)

Final problem is of course if I make changes that I think will fix it, testing is a bit of pain !grr

From the observed behaviour of Nagios it seems that contact definitions do NOT override host and service definitions.

All my contacts have:

service_notification_options w,u,c,r
host_notification_options d,u,r

Yet all my hosts/services only alert on critical and recovery, as thats how they are defined.

Now I’m even more confused :?

To test it, create a dummy host/service with a valid parent. Of course, your dummy is going to fail, since it’s IP is not even valid, so you should get a timeout.
Disable event handlers if in fact, you aren’t using them. Saves nagios the trouble when it attempts to use one, and then finding out you haven’t defined one.