Hello,
In order to monitor the status of a link (up or down), we use passive service checks which receive traps to generate the appropriate alert; however, we have found that only one notification is raised when a link goes down therefore our excalations do not work.
We don’t know whether this is due to a bug or our own misconfiguration, but any input you could offer would be greatly appreciated. We are currently running Nagios v.2.9 but we also observed this issue under v.2.5 & 2.6. Our escalation configuration is noted below for your information:
Hi,
Apologies for the late response but I have been on holiday
Our service definitions are configured as follows:
obsess_over_service 1
check_freshness 0
notifications_enabled 1
check_period 24x7
normal_check_interval 2
retry_check_interval 1
notification_interval 20
notification_period 24x7 <—24x7 is defined within timeperiods.cfg
notification_options w,u,c,r
register 0
In order to work around this issue we have had to develop a script which manually resubmits a trap to nagios if a recovery has not been detected. this is far from ideal as we often find that recovery notifications are clobbered by our own resubmissions.
The reason is because the notifications piece in Nagios is driven by alerts – if no alerts come in, it won’t notify anyone. When defining escalations, you are defining thresholds, not timers. So when an alert comes in, Nagios compares it against the time threshold for the given escalation – it’s not a timer.
The way I got around this (and I don’t really like it) is my OCSP command writes the current state of the service to a file – say /usr/nagios/ocsp/hostname-service_name.txt. Then the check_freshness command for these types of services just outputs that file (which contains the plugin output and status code from the time it was last checked). If no new data has come in, the OCSP command just regurgitates this output back to the same file, and the process continues until some new data is sent in to the service passively.
We too utilise a similar hack to work around this issue. But don’t you think this is an area in Nagios which requires some development?
Typically, one trap is issued by any device upon a change in status; therefore in a monitoring chain shouldn’t it really be the job of Nagios to provide us with the functionality to manage this trap as much as any service test result?
As I have indicated, my organisation mainly uses SNMP traps to monitor the link status of our STPs etc, and in the event of a sig. link or port failing, we definately require more than one alert to be despatched.