Hi Group,
I posted this to nagios-users, but I’m not having much luck. Anyone have any ideas on this?
I’m experiencing a problem with Nagios not sending recovery alerts to
groups of users who were notified via escalation of a problem. In the
example listed below, if a service is critical for more than an hour,
an escalation is established to send a page. In this case, the page
was sent after 60 minutes of the service being in the critical state,
but when the service recovered, a recovery page was not sent. I’ve
included the the relevant configuration entries below, and also a log
from an event which occurred earlier today and did not result in a
recovery page being sent. I am running Nagios version 3.0.6 on FreeBSD
7.2.
Here is the service definition:
Service definition check_local_procs
define service{
use generic-service
host_name NETMGT
service_description PROCS
is_volatile 0
check_period 24x7
max_check_attempts 2
normal_check_interval 5
retry_check_interval 1
contact_groups NETWORK-TEAM
notification_interval 5
notification_period 24x7
notification_options c,r
check_command check_local_procs1!175!190
}
The contact group NETWORK-TEAM sends an e-mail to all of the members
of our networking team.
Here is the relevant escalation for this service (an all services,
actually):
Send a page after 60 minutes during non work hours if a service is
down
VPN-SITES group is excluded from paging during non work hours
define serviceescalation{
hostgroup_name !VPN-SITES, .*
service_description .*
first_notification 12
last_notification 12
contact_groups NETWORK-TEAM,NETWORK-TEAM-SNPP
escalation_period nonworkhours
notification_interval 5
}
The NETWORK-TEAM-SNPP group sends alphanumeric pages to our network
group. The intention of this escalation is to send a single page to
the pagers alerting us to a problem. We don’t want to get spammed with
pages, and a single page is sufficient. However, with this escalation,
when the service recovers, we are only getting notified of the
recovery to the NETWORK-TEAM contact, and the NETWORK-TEAM-SNPP
contact is NOT being notified of the recovery.
In the log below, you can see at 13:56 that the notifications were
escalated to our pagers (via the command notify-by-ipn), however, at
14:13 when the service recovered, we were only notified of the
recovery via e-mail.
[08-08-2009 14:13:51] SERVICE NOTIFICATION:
tech3;NETMGT;PROCS;OK;notify-by-email;PROCS OK: 146 processes
[08-08-2009 14:13:51] SERVICE NOTIFICATION:
tech2;NETMGT;PROCS;OK;notify-by-email;PROCS OK: 146 processes
[08-08-2009 14:13:51] SERVICE NOTIFICATION:
tech1;NETMGT;PROCS;OK;notify-by-email;PROCS OK: 146 processes
[08-08-2009 14:06:14] SERVICE NOTIFICATION:
tech3;NETMGT;PROCS;CRITICAL;notify-by-email;PROCS CRITICAL: 344
processes
[08-08-2009 14:06:14] SERVICE NOTIFICATION:
tech2;NETMGT;PROCS;CRITICAL;notify-by-email;PROCS CRITICAL: 344
processes
[08-08-2009 14:06:14] SERVICE NOTIFICATION:
tech1;NETMGT;PROCS;CRITICAL;notify-by-email;PROCS CRITICAL: 344
processes
[08-08-2009 14:01:14] SERVICE NOTIFICATION:
tech3;NETMGT;PROCS;CRITICAL;notify-by-email;PROCS CRITICAL: 348
processes
[08-08-2009 14:01:14] SERVICE NOTIFICATION:
tech2;NETMGT;PROCS;CRITICAL;notify-by-email;PROCS CRITICAL: 348
processes
[08-08-2009 14:01:14] SERVICE NOTIFICATION:
tech1;NETMGT;PROCS;CRITICAL;notify-by-email;PROCS CRITICAL: 348
processes
[08-08-2009 13:56:15] SERVICE NOTIFICATION: tech3-
ipn;NETMGT;PROCS;CRITICAL;notify-by-ipn;PROCS CRITICAL: 304 processes
[08-08-2009 13:56:14] SERVICE NOTIFICATION: tech2-
ipn;NETMGT;PROCS;CRITICAL;notify-by-ipn;PROCS CRITICAL: 304 processes
[08-08-2009 13:56:14] SERVICE NOTIFICATION: tech1-
ipn;NETMGT;PROCS;CRITICAL;notify-by-ipn;PROCS CRITICAL: 304 processes
[08-08-2009 13:56:14] SERVICE NOTIFICATION:
tech3;NETMGT;PROCS;CRITICAL;notify-by-email;PROCS CRITICAL: 304
processes
[08-08-2009 13:56:14] SERVICE NOTIFICATION:
tech2;NETMGT;PROCS;CRITICAL;notify-by-email;PROCS CRITICAL: 304
processes
[08-08-2009 13:56:14] SERVICE NOTIFICATION:
tech1;NETMGT;PROCS;CRITICAL;notify-by-email;PROCS CRITICAL: 304
processes
If you need more parts of the configuration, please let me know. I’m
not sure why we aren’t being notified of the recovery via our pagers,
because Nagios is supposed to send a recovery notification to everyone
who was notified of the problem. Is there something wrong with my
escalation recovery configuration or my understanding of escalations?
Thanks in advance!
-Steve