Nagios 3.x host up notifications not working when host has b


#1

Guys

I’m testing nagios 3.0a and I it looks like the host notification cycle is broken.

I have the default generic contact template :

define contact{
name generic-contact ; The name of this contact template
service_notification_period 24x7 ; service notifications can be sent anytime
host_notification_period 24x7 ; host notifications can be sent anytime
service_notification_options w,u,c,r,f,s ; send notifications for all service states, flapping events, and scheduled downtime events
host_notification_options d,u,r,f,s ; send notifications for all host states, flapping events, and scheduled downtime events
service_notification_commands notify-service-by-email ; send service notifications via email
host_notification_commands notify-host-by-email ; send host notifications via email
register 0 ; DONT REGISTER THIS DEFINITION - ITS NOT A REAL CONTACT, JUST A TEMPLATE!
}

You can see host notification are for all states.

Then I have myself as a contact :

define contact{
contact_name astuck
use generic-contact
alias SysAdmin1
email {my email}
}

Then I added myself to a contact group :

define contactgroup{
contactgroup_name admins
alias SysAdmins
members astuck
}

Then I added 1 more host template to the already existing 2 :

define host{
name generic-host ; The name of this host template
notifications_enabled 1 ; Host notifications are enabled
event_handler_enabled 1 ; Host event handler is enabled
flap_detection_enabled 1 ; Flap detection is enabled
failure_prediction_enabled 1 ; Failure prediction is enabled
process_perf_data 1 ; Process performance data
retain_status_information 1 ; Retain status information across program restarts
retain_nonstatus_information 1 ; Retain non-status information across program restarts
notification_period 24x7 ; Send host notifications at any time
register 0 ; DONT REGISTER THIS DEFINITION - ITS NOT A REAL HOST, JUST A TEMPLATE!
}

define host{
name generic-linux
use generic-host
check_period 24x7
check_interval 5
retry_interval 1
max_check_attempts 10
check_command check-host-alive
notification_interval 120
notification_options d,u,r
register 0
}

define host{
name nonprod
use generic-linux
contact_groups admins
register 0
}

Now I define a host :

define host{
use nonprod
host_name lithium
alias Oracle Dev 2
address lithium
}

This host inherits the ‘nonprod’ template which in turn get it’s stuff from generic-linux which gets it’s stuff from generic-host.
The generic-linux template shows that notification options are d,u,r.

When I take down host lithium for a short time everything works :

I turned on notification debugging and this is what I get :

host down :

[1181781555.111892:032.0] ** Host Notification Attempt ** Host: ‘lithium’, Type: 0, Current State: 1, Last Notification: Wed Dec 31 16:00:00 1969
[1181781555.111914:032.0] Notification viability test passed.
[1181781555.111922:032.1] Current notification number: 1
[1181781555.111928:032.2] Creating list of contacts to be notified.
[1181781555.111934:032.1] Host notification will NOT be escalated.
[1181781555.111940:032.2] Adding contact ‘astuck’ to notification list.
[1181781555.111994:032.2] ** Attempting to notifying contact ‘astuck’…
[1181781555.112002:032.2] ** Checking host notification viability for contact ‘astuck’…
[1181781555.112013:032.2] ** Host notification viability for contact ‘astuck’ PASSED.
[1181781555.112020:032.2] ** Notifying contact ‘astuck’
[1181781555.112075:032.2] Raw Command: /usr/bin/printf “%b” “***** Nagios \n\nNotification Type: $NOTIFICATIONTYPE$\nHost: $HOSTNAME$\nState: $HOSTSTATE$\nAddress: $HOSTADDRESS$\nInfo: $HOSTOUTPUT$\n\nDate/Time: $LONGDATETIME$\n" | /bin/mail -s " $NOTIFICATIONTYPE$ Host Alert: $HOSTNAME$ is $HOSTSTATE$ " $CONTACTEMAIL$
[1181781555.112082:032.2] Processed Command: /usr/bin/printf “%b” "
Nagios \n\nNotification Type: PROBLEM\nHost: lithium\nState: DOWN\nAddress: lithium\nInfo: (No output returned from host check)\n\nDate/Time: Wed Jun 13 17:39:15 PDT 2007\n" | /bin/mail -s " PROBLEM Host Alert: lithium is DOWN " {my email}
[1181781555.155552:032.0] No contacts were notified. Next possible notification time: Wed Jun 13 19:39:15 2007
[1181781555.155583:032.0] 1 contacts were notified.[1181781555.156008:032.0] ** Host Notification Attempt ** Host: ‘lithium’, Type: 0, Current State: 1, Last Notification: Wed Dec 31 16:00:00 1969
[1181781555.156023:032.0] Notification viability test passed.
[1181781555.156030:032.1] Current notification number: 2
[1181781555.156036:032.2] Creating list of contacts to be notified.
[1181781555.156042:032.1] Host notification will NOT be escalated.
[1181781555.156048:032.2] Adding contact ‘astuck’ to notification list.
[1181781555.156110:032.2] ** Attempting to notifying contact ‘astuck’…
[1181781555.156117:032.2] ** Checking host notification viability for contact ‘astuck’…
[1181781555.156129:032.2] ** Host notification viability for contact ‘astuck’ PASSED.
[1181781555.156136:032.2] ** Notifying contact ‘astuck’
[1181781555.156189:032.2] Raw Command: /usr/bin/printf “%b” "
Nagios \n\nNotification Type: $NOTIFICATIONTYPE$\nHost: $HOSTNAME$\nState: $HOSTSTATE$\nAddress: $HOSTADDRESS$\nInfo: $HOSTOUTPUT$\n\nDate/Time: $LONGDATETIME$\n" | /bin/mail -s " $NOTIFICATIONTYPE$ Host Alert: $HOSTNAME$ is $HOSTSTATE$ " $CONTACTEMAIL$
[1181781555.156196:032.2] Processed Command: /usr/bin/printf “%b” "
Nagios ***\n\nNotification Type: PROBLEM\nHost: lithium\nState: DOWN\nAddress: lithium\nInfo: (Host Check Timed Out)\n\nDate/Time: Wed Jun 13 17:39:15 PDT 2007\n" | /bin/mail -s " PROBLEM Host Alert: lithium is DOWN **” {my email}
[1181781555.209723:032.0] No contacts were notified. Next possible notification time: Wed Jun 13 19:39:15 2007
[1181781555.209754:032.0] 1 contacts were notified.

so far so good

host comes back :

[1181781575.173652:032.1] Its not yet time to re-notify the contacts about this host problem…
[1181781575.173661:032.1] Next acceptable notification time: Wed Jun 13 19:39:15 2007
[1181781575.173685:032.0] Notification viability test failed. No notification will be sent out.
[1181781575.174047:032.0] ** Host Notification Attempt ** Host: ‘lithium’, Type: 0, Current State: 1, Last Notification: Wed Jun 13 17:39:15 2007
[1181781575.174061:032.1] Its not yet time to re-notify the contacts about this host problem…
[1181781575.174070:032.1] Next acceptable notification time: Wed Jun 13 19:39:15 2007
[1181781575.174077:032.0] Notification viability test failed. No notification will be sent out.
[1181781585.035292:032.0] ** Host Notification Attempt ** Host: ‘lithium’, Type: 0, Current State: 1, Last Notification: Wed Jun 13 17:39:15 2007
[1181781585.035319:032.1] Its not yet time to re-notify the contacts about this host problem…
[1181781585.035328:032.1] Next acceptable notification time: Wed Jun 13 19:39:15 2007
[1181781585.035335:032.0] Notification viability test failed. No notification will be sent out.
[1181781615.083677:032.0] ** Host Notification Attempt ** Host: ‘lithium’, Type: 0, Current State: 1, Last Notification: Wed Jun 13 17:39:15 2007
[1181781615.083708:032.1] Its not yet time to re-notify the contacts about this host problem…
[1181781615.083717:032.1] Next acceptable notification time: Wed Jun 13 19:39:15 2007
[1181781615.083724:032.0] Notification viability test failed. No notification will be sent out.
[1181781615.084112:032.0] ** Host Notification Attempt ** Host: ‘lithium’, Type: 0, Current State: 1, Last Notification: Wed Jun 13 17:39:15 2007
[1181781615.084125:032.1] Its not yet time to re-notify the contacts about this host problem…
[1181781615.084135:032.1] Next acceptable notification time: Wed Jun 13 19:39:15 2007
[1181781615.084141:032.0] Notification viability test failed. No notification will be sent out.
[1181781625.246191:032.0] ** Host Notification Attempt ** Host: ‘lithium’, Type: 0, Current State: 1, Last Notification: Wed Jun 13 17:39:15 2007
[1181781625.246218:032.1] Its not yet time to re-notify the contacts about this host problem…
[1181781625.246228:032.1] Next acceptable notification time: Wed Jun 13 19:39:15 2007
[1181781625.246234:032.0] Notification viability test failed. No notification will be sent out.
[1181781629.273122:032.0] ** Host Notification Attempt ** Host: ‘lithium’, Type: 0, Current State: 0, Last Notification: Wed Dec 31 16:00:00 1969
[1181781629.273153:032.0] Notification viability test passed.
[1181781629.273161:032.1] Current notification number: 3
[1181781629.273167:032.2] Creating list of contacts to be notified.
[1181781629.273173:032.1] Host notification will NOT be escalated.
[1181781629.273179:032.2] Adding contact ‘astuck’ to notification list.
[1181781629.273241:032.2] ** Attempting to notifying contact ‘astuck’…
[1181781629.273248:032.2] ** Checking host notification viability for contact ‘astuck’…
[1181781629.273262:032.2] ** Host notification viability for contact ‘astuck’ PASSED.
[1181781629.273269:032.2] ** Notifying contact ‘astuck’
[1181781629.273324:032.2] Raw Command: /usr/bin/printf “%b” “***** Nagios \n\nNotification Type: $NOTIFICATIONTYPE$\nHost: $HOSTNAME$\nState: $HOSTSTATE$\nAddress: $HOSTADDRESS$\nInfo: $HOSTOUTPUT$\n\nDate/Time: $LONGDATETIME$\n" | /bin/mail -s " $NOTIFICATIONTYPE$ Host Alert: $HOSTNAME$ is $HOSTSTATE$ " $CONTACTEMAIL$
[1181781629.273331:032.2] Processed Command: /usr/bin/printf “%b” "
Nagios ***\n\nNotification Type: RECOVERY\nHost: lithium\nState: UP\nAddress: lithium\nInfo: PING OK - Packet loss = 0%, RTA = 0.18 ms\n\nDate/Time: Wed Jun 13 17:40:29 PDT 2007\n" | /bin/mail -s " RECOVERY Host Alert: lithium is UP **” {my email}
[1181781629.346436:032.0] No contacts were notified. Next possible notification time: Wed Jun 13 19:40:29 2007
[1181781629.346465:032.0] 1 contacts were notified.

So I got 2 DOWN and 1 UP notifications.

HOWEVER, If the host is down a little longer I NEVER get a host UP notification !

If I wait for 4 host down notifications before I bring the box back then I don’t get the host UP notification and according to
the log that’s cause nagios never intends to send one :
Instead nagios starts to check for services again when It shouldn’t since the host is still in DOWN state. This is where I think the problem lies. Since nagios is now checking for services and they are actually available again it simply assumes that the host must be available as well - hence no host up notification :

[1181782535.056952:032.0] ** Host Notification Attempt ** Host: ‘lithium’, Type: 0, Current State: 1, Last Notification: Wed Jun 13 17:54:45 2007
[1181782535.056985:032.1] Its not yet time to re-notify the contacts about this host problem…
[1181782535.056995:032.1] Next acceptable notification time: Wed Jun 13 19:54:45 2007
[1181782535.057002:032.0] Notification viability test failed. No notification will be sent out.
[1181782575.250861:032.0] ** Service Notification Attempt ** Host: ‘lithium’, Service: ‘DISK USAGE /boot’, Type: 0, Current State: 0, Last Notification: Wed Dec 31 16:00:00 1969
[1181782575.250885:032.1] We shouldn’t notify about this recovery.
[1181782575.250892:032.0] Notification viability test failed. No notification will be sent out.
[1181782585.236571:032.0] ** Service Notification Attempt ** Host: ‘lithium’, Service: ‘MEM’, Type: 0, Current State: 0, Last Notification: Wed Dec 31 16:00:00 1969
[1181782585.236601:032.1] We shouldn’t notify about this recovery.
[1181782585.236609:032.0] Notification viability test failed. No notification will be sent out.
[1181782595.188430:032.0] ** Service Notification Attempt ** Host: ‘lithium’, Service: ‘DISK USAGE /home’, Type: 0, Current State: 0, Last Notification: Wed Dec 31 16:00:00 1969

Instead of doing a ‘* Host Notification Attempt ** Host: ‘lithium’, Type: 0, Current State: 0’ like it did before it starts a service check for /boot :

  • Service Notification Attempt ** Host: ‘lithium’, Service: 'DISK USAGE /boot’
    which clearly doesn’t deserve a recover notification cause we didn’t get a warning/critical notification in the first place and that’s ok but the host up notification is missing.

Looks like a bug to me since it works when the down time is shorter.

Am I right ?