Problem with Recovery Notices

sbader · August 10, 2009, 8:50pm

Hi Group,

I posted this to nagios-users, but I’m not having much luck. Anyone have any ideas on this?

I’m experiencing a problem with Nagios not sending recovery alerts to
groups of users who were notified via escalation of a problem. In the
example listed below, if a service is critical for more than an hour,
an escalation is established to send a page. In this case, the page
was sent after 60 minutes of the service being in the critical state,
but when the service recovered, a recovery page was not sent. I’ve
included the the relevant configuration entries below, and also a log
from an event which occurred earlier today and did not result in a
recovery page being sent. I am running Nagios version 3.0.6 on FreeBSD
7.2.

Here is the service definition:

Service definition check_local_procs

define service{
use generic-service
host_name NETMGT
service_description PROCS
is_volatile 0
check_period 24x7
max_check_attempts 2
normal_check_interval 5
retry_check_interval 1
contact_groups NETWORK-TEAM
notification_interval 5
notification_period 24x7
notification_options c,r
check_command check_local_procs1!175!190
}

The contact group NETWORK-TEAM sends an e-mail to all of the members
of our networking team.

Here is the relevant escalation for this service (an all services,
actually):

Send a page after 60 minutes during non work hours if a service is

down

VPN-SITES group is excluded from paging during non work hours

define serviceescalation{
hostgroup_name !VPN-SITES, .*
service_description .*
first_notification 12
last_notification 12
contact_groups NETWORK-TEAM,NETWORK-TEAM-SNPP
escalation_period nonworkhours
notification_interval 5
}

The NETWORK-TEAM-SNPP group sends alphanumeric pages to our network
group. The intention of this escalation is to send a single page to
the pagers alerting us to a problem. We don’t want to get spammed with
pages, and a single page is sufficient. However, with this escalation,
when the service recovers, we are only getting notified of the
recovery to the NETWORK-TEAM contact, and the NETWORK-TEAM-SNPP
contact is NOT being notified of the recovery.

In the log below, you can see at 13:56 that the notifications were
escalated to our pagers (via the command notify-by-ipn), however, at
14:13 when the service recovered, we were only notified of the
recovery via e-mail.

[08-08-2009 14:13:51] SERVICE NOTIFICATION:
tech3;NETMGT;PROCS;OK;notify-by-email;PROCS OK: 146 processes
[08-08-2009 14:13:51] SERVICE NOTIFICATION:
tech2;NETMGT;PROCS;OK;notify-by-email;PROCS OK: 146 processes
[08-08-2009 14:13:51] SERVICE NOTIFICATION:
tech1;NETMGT;PROCS;OK;notify-by-email;PROCS OK: 146 processes

[08-08-2009 14:06:14] SERVICE NOTIFICATION:
tech3;NETMGT;PROCS;CRITICAL;notify-by-email;PROCS CRITICAL: 344
processes
[08-08-2009 14:06:14] SERVICE NOTIFICATION:
tech2;NETMGT;PROCS;CRITICAL;notify-by-email;PROCS CRITICAL: 344
processes
[08-08-2009 14:06:14] SERVICE NOTIFICATION:
tech1;NETMGT;PROCS;CRITICAL;notify-by-email;PROCS CRITICAL: 344
processes

[08-08-2009 14:01:14] SERVICE NOTIFICATION:
tech3;NETMGT;PROCS;CRITICAL;notify-by-email;PROCS CRITICAL: 348
processes
[08-08-2009 14:01:14] SERVICE NOTIFICATION:
tech2;NETMGT;PROCS;CRITICAL;notify-by-email;PROCS CRITICAL: 348
processes
[08-08-2009 14:01:14] SERVICE NOTIFICATION:
tech1;NETMGT;PROCS;CRITICAL;notify-by-email;PROCS CRITICAL: 348
processes

[08-08-2009 13:56:15] SERVICE NOTIFICATION: tech3-
ipn;NETMGT;PROCS;CRITICAL;notify-by-ipn;PROCS CRITICAL: 304 processes
[08-08-2009 13:56:14] SERVICE NOTIFICATION: tech2-
ipn;NETMGT;PROCS;CRITICAL;notify-by-ipn;PROCS CRITICAL: 304 processes
[08-08-2009 13:56:14] SERVICE NOTIFICATION: tech1-
ipn;NETMGT;PROCS;CRITICAL;notify-by-ipn;PROCS CRITICAL: 304 processes
[08-08-2009 13:56:14] SERVICE NOTIFICATION:
tech3;NETMGT;PROCS;CRITICAL;notify-by-email;PROCS CRITICAL: 304
processes
[08-08-2009 13:56:14] SERVICE NOTIFICATION:
tech2;NETMGT;PROCS;CRITICAL;notify-by-email;PROCS CRITICAL: 304
processes
[08-08-2009 13:56:14] SERVICE NOTIFICATION:
tech1;NETMGT;PROCS;CRITICAL;notify-by-email;PROCS CRITICAL: 304
processes

If you need more parts of the configuration, please let me know. I’m
not sure why we aren’t being notified of the recovery via our pagers,
because Nagios is supposed to send a recovery notification to everyone
who was notified of the problem. Is there something wrong with my
escalation recovery configuration or my understanding of escalations?

Thanks in advance!

-Steve

luca · August 11, 2009, 11:54am

some notification_option (i.e.: r) missing for the group not receiving the recovery message?
does the nagios notifications page report a recovery notification for this group?

sbader · August 12, 2009, 6:03am

Thanks for the response. All of my contacts in the NETWORK-TEAM-SNPP group have service_notification_options c,r,u set. In the notifications log in Nagios, I see the initial escalation, and I do see it dispatch a recovery via e-mail to NETWORK-TEAM, but it never sends recovery notifications to the contacts it escalated to using the NETWORK-TEAM-SNPP group.

The odd part is that host recovery notifications, including escalations, are working just fine, and I have them setup the exact same way.

Thanks,

-Steve

sbader · August 13, 2009, 3:17pm

I found the solution to this problem. The issue is that my escalation only involved a single notification. If a host went down, a page got sent out, and if it recovered on the next notification interval, the recovery page got sent as well. If it recovered on any other notification interval, the escalation did not apply.

I added the following to my configuration and it is working as expected now. Basically, after the first page is sent out, I allow any recovery notification to get sent to the pagers.

Send a page for a recovery during work hours

define hostescalation{
host_name .*
first_notification 2
last_notification 0
contact_groups NETWORK-TEAM,NETWORK-TEAM-SNPP
escalation_period workhours
notification_interval 5
escalation_options r
}

Hope this helps someone in the future.

Thanks.

-Steve

luca · August 15, 2009, 7:56pm

Thanks for sharing

wouldn’t it be enough to explicitly set the option “r” in the escalation definition ?