Passive checks only trigger one alert notification


#1

Hello,
In order to monitor the status of a link (up or down), we use passive service checks which receive traps to generate the appropriate alert; however, we have found that only one notification is raised when a link goes down therefore our excalations do not work.

We don’t know whether this is due to a bug or our own misconfiguration, but any input you could offer would be greatly appreciated. We are currently running Nagios v.2.9 but we also observed this issue under v.2.5 & 2.6. Our escalation configuration is noted below for your information:

define serviceescalation {
host_name
service_description
first_notification 1
last_notification 0
contact_groups admins
notification_interval 20
escalation_period workhours
}

define serviceescalation {
host_name
service_description
first_notification 1
last_notification 0
contact_groups support_primary
notification_interval 20
}

define serviceescalation {
host_name
service_description
first_notification 2
last_notification 0
contact_groups support_shadow
notification_interval 20
}

Thanks!


#2

whats your notification_interval set to in your service definitions? i hope it’s not 0.

also what are your max_check_attempts, normal_check_interval, and retry_check_interval 's set to in your service definitions?


#3

I am experiencing the same issue, this is limited to passively checked items only. Actively checked items escalations work as intended.


#4

Hi,
Apologies for the late response but I have been on holiday :wink:

Our service definitions are configured as follows:

obsess_over_service 1
check_freshness 0
notifications_enabled 1
check_period 24x7
normal_check_interval 2
retry_check_interval 1
notification_interval 20
notification_period 24x7 <—24x7 is defined within timeperiods.cfg
notification_options w,u,c,r
register 0

In order to work around this issue we have had to develop a script which manually resubmits a trap to nagios if a recovery has not been detected. this is far from ideal as we often find that recovery notifications are clobbered by our own resubmissions.


#5

I’ve seen this before (and have the same problem)

The reason is because the notifications piece in Nagios is driven by alerts – if no alerts come in, it won’t notify anyone. When defining escalations, you are defining thresholds, not timers. So when an alert comes in, Nagios compares it against the time threshold for the given escalation – it’s not a timer.

The way I got around this (and I don’t really like it) is my OCSP command writes the current state of the service to a file – say /usr/nagios/ocsp/hostname-service_name.txt. Then the check_freshness command for these types of services just outputs that file (which contains the plugin output and status code from the time it was last checked). If no new data has come in, the OCSP command just regurgitates this output back to the same file, and the process continues until some new data is sent in to the service passively.

It’s a hack, but it works.


#6

We too utilise a similar hack to work around this issue. But don’t you think this is an area in Nagios which requires some development?

Typically, one trap is issued by any device upon a change in status; therefore in a monitoring chain shouldn’t it really be the job of Nagios to provide us with the functionality to manage this trap as much as any service test result?

As I have indicated, my organisation mainly uses SNMP traps to monitor the link status of our STPs etc, and in the event of a sig. link or port failing, we definately require more than one alert to be despatched.


#7

Does anyone know, is this fixed in Nagios version 3?


#8

Tnx, everyone for answering. :slight_smile:

It’s the same in version 3. Seems like it is a feature…