Host notification problems w/ Nagios 1.2


#1

Hi Everyone,
I’m having a little problem with nagios sending notifications whe it recognizes a host is down. To be more specific, I have host checks turned on and host notification_interval set to 15 minutes. When a service goes down, the host check fires correctly and it notices the host is down. The first host down notification is sent as it should but no subsequent notifications are ever sent.
When the host comes back up, the recovery notifications go out but the intervening notifications that should be coming out every 15 minutes or so.
(It is acting as if the host notification interbal were set to 0 even though I know it is not)
Has anyone ever run into this problem before?

Here are the relevent snipets from my config files:
(host names and ip’s removed to protect the innocent. :frowning:

Hosts.cfg

define host {
name GenericHost
register 0
checks_enabled 0
event_handler_enabled 1
notifications_enabled 1
notification_options d,u,r
notification_interval 15
notification_period 24x7
process_perf_data 0
retain_nonstatus_information 0
retain_status_information 1
flap_detection_enabled 0
max_check_attempts 1
check_command check-host-alive
}

define host {
use GenericHost
name PingableHost
register 0
checks_enabled 1
}

define host {
use GenericHost
host_name test.foo.com
alias test.foo.com
address x.x.x.x
}

Services.cfg

define service {
name GenericService
register 0
check_freshness 0
obsess_over_service 0
event_handler_enabled 1
is_volatile 0
passive_checks_enabled 0
notifications_enabled 1
notification_options w,c,u,r
notification_interval 15
notification_period 24x7
retain_status_information 1
normal_check_interval 5
check_period 24x7
retry_check_interval 1
active_checks_enabled 1
retain_nonstatus_information 0
process_perf_data 1
parallelize_check 1
max_check_attempts 3
contact_groups all
}

define service {
use GenericService
host_name test.foo.com
service_description SSH
check_command check_ssh!22
max_check_attempts 3
}

Any help anyone can give would be appreciated!
Chris


#2

do contacts and contactgroups have the same notification intervals? somewhere there is a limit. try readiung through the notfication system description in the dopcs for other possible failure points.

Luca


#3

Hi Luca,
I have check just about everything I can think of and turned almost every dial and switch I can find in an attempt to make this work. Here are my contact and contact groups from the config. There is nothing special in them.

define contactgroup {
contactgroup_name all
alias All Contacts
members page-foo,email-foo
}
[2]root@mon01.readytechs.com:/home/nagios/exalt/etc> more contacts.cfg
define contact {
contact_name page-foo
alias page-foo
email page-foo@foo.com
pager page-foo@foo.com
service_notification_options c,u,r
service_notification_period 24x7
service_notification_commands notify-by-epager
host_notification_options d,u,r
host_notification_period 24x7
host_notification_commands host-notify-by-epager
}

define contact {
contact_name email-foo
alias email-foo
email monitor-foo@foo.com
pager monitor-foo@foo.com
service_notification_options w,c,u,r
service_notification_period 24x7
service_notification_commands notify-by-email
host_notification_options d,u,r
host_notification_period 24x7
host_notification_commands host-notify-by-email
}

 I've been using Nagios for over 4 years and this is the first time I've noticed this issue. It might be a configuration problem on my end or it may be a bug that has always been there and I just never had hosts down often enough or long enough without being ack'd or downtimed to notice.

I appreciate any help or pointers anyone can give.
Thanks,
Chris


#4

i don’t think it’s a bug where notifications aren’t sent at all… too many people around having it work correctly :slight_smile:

on your test.foo.com machine you have
checks_enabled 0

there’s a lot of stuff in those ocnfig files i’ve never used so i can’t say if there IS a problem or not…

Just try defining a host from scratch with a single service. Possibly checking the logs could help out too…

In the notifications page is there only a single notification being sent or possibly more than one? (this could be if a “stupid” spam filter is active)

Luca


#5

log_notifications=1 in nagios.cfg and if so, then you should be able to look at the notifications page as luca pointed out.

This bothers me "The first host down notification is sent as it should but no subsequent notifications are ever sent."
Your notification_interval is set to 15 and I assume that 1 interval is set to 60 seconds, or did you change that? Find out, cause if you did, perhaps you aren’t waiting long enough.
If you didn’t, then are you waiting long enough? Does Current Notification Number: ever change on the Host State Information page?

As a last resort, try using the cgi page for Host State Information and clik on the Host Command to disable notifications, and then when it changes then clik enable notifications. Now try to down your device and see if you get notified more than once.


#6

I’m having a very similar issue. The host check script executes but then never sends notifications.

I’ve started troubleshooting by doing the following:

Add new host with basic options in the hosts.cfg file:

define host{ name generic-host ; notifications_enabled 1 ; event_handler_enabled 1 ; flap_detection_enabled 1 ; process_perf_data 1 ; retain_status_information 1 ; retain_nonstatus_information 1 ; register 0 ; } define host{ use generic-host host_name joeg5 alias joeg5 address 10.0.0.151 check_command check-host-alive max_check_attempts 10 notification_interval 1440 notification_period 24x7 notification_options d,u,r }

Setup my new host in the services.cfg file:

define service{ name generic-service ; active_checks_enabled 1 ; passive_checks_enabled 1 ; parallelize_check 1 ; obsess_over_service 1 ; check_freshness 0 ; notifications_enabled 1 ; notification_interval 1440 ; event_handler_enabled 1 ; flap_detection_enabled 1 ; process_perf_data 1 ; retain_status_information 1 ; retain_nonstatus_information 1 ; register 0 ; } define service{ use generic-service host_name joeg5 service_description PING is_volatile 0 check_period 24x7 max_check_attempts 3 normal_check_interval 5 retry_check_interval 1 contact_groups unixadmin-emails,unixadmin-pagers,itec notification_period 24x7 notification_options c,r check_command check_ping!300.0,20%!800.0,60% }

I am in the unixadmin-emails, unixadmin-pagers, and itec groups. I have the following options for my contacts.cfg:

service_notification_period 24x7 host_notification_period 24x7 service_notification_options w,c,r host_notification_options d,r

For this system, when I shut it down, here’s what I see in my var/nagios.log:

[1142873178] HOST ALERT: joeg5;DOWN;SOFT;1;CRITICAL - Plugin timed out after 10 seconds [1142873188] HOST ALERT: joeg5;DOWN;SOFT;2;CRITICAL - Plugin timed out after 10 seconds [1142873198] HOST ALERT: joeg5;DOWN;SOFT;3;CRITICAL - Plugin timed out after 10 seconds [1142873208] HOST ALERT: joeg5;DOWN;SOFT;4;CRITICAL - Plugin timed out after 10 seconds [1142873218] HOST ALERT: joeg5;DOWN;SOFT;5;CRITICAL - Plugin timed out after 10 seconds [1142873228] HOST ALERT: joeg5;DOWN;SOFT;6;CRITICAL - Plugin timed out after 10 seconds [1142873238] HOST ALERT: joeg5;DOWN;SOFT;7;CRITICAL - Plugin timed out after 10 seconds [1142873248] HOST ALERT: joeg5;DOWN;SOFT;8;CRITICAL - Plugin timed out after 10 seconds [1142873258] HOST ALERT: joeg5;DOWN;SOFT;9;CRITICAL - Plugin timed out after 10 seconds [1142873268] HOST ALERT: joeg5;DOWN;HARD;10;CRITICAL - Plugin timed out after 10 seconds [1142873268] SERVICE ALERT: joeg5;PING;CRITICAL;HARD;1;CRITICAL - Plugin timed out after 10 seconds

I would expect to see a HARD 2, and a HARD 3 for the SESRVICE ALERT or something like that for the HOST ALERT. After that, I would also expect to see a SERVICE NOTIFICATION entry.

Can anyone point me in the right direction? I am using Nagios 1.2.


#7

So you’re definitely getting at least one notification, mm? And you’re wondering why you’re not seeing any more? Well, I’d say you’ve definitely got a bit of waiting to do. Your notification interval is set to 1440. Now, unless you’ve modified your interval length setting in Nagios.cfg, that means you won’t receive a second notification until…calculates…24 hours after your first notification.


#8

I never get even one notification. I think with your issue, you were getting at least one notification correct?


#9

I should mention that when the service (I just had PING for testing purposes) comes out of its downed state, if the host has come out of its DOWNed state also, the SERVICE ALERT change triggers a SERVICE NOTIFICATION correctly as it should be.

[1142883196] SERVICE ALERT: joeg5;PING;OK;HARD;3;PING OK - Packet loss = 0%, RTA = 0.82 ms [1142883197] SERVICE NOTIFICATION: hootjr-email;joeg5;PING;OK;notify-by-email;PING OK - Packet loss = 0%, RTA = 0.82 ms

From reading more intently in the documentation, there is text about the following:

[quote]Monitoring Services on Down or Unreachable Hosts

The main purpose of Nagios is to monitor services that run on or are provided by physical hosts or devices on your network. It should be obvious that if a host or device on your network goes down, all services that it offers will also go down with it. Similarly, if a host becomes unreachable, Nagios will not be able to monitor the services associated with that host.

Nagios recognizes this fact and attempts to check for such a scenario when there are problems with a service. Whenever a service check results in a non-OK status level, Nagios will attempt to check and see if the host that the service is running on is “alive”. Typically this is done by pinging the host and seeing if any response is received. If the host check commmand returns a non-OK state, Nagios assumes that there is a problem with the host. In this situation Nagios will “silence” all potential alerts for services running on the host and just notify the appropriate contacts that the host is down or unreachable. If the host check command returns an OK state, Nagios will recognize that the host is alive and will send out an alert for the service that is misbehaving.
[/quote]

In my host going into a DOWN state, the text above describes that I shouldn’t be getting notified for any services under that host. And this apprears to be working. What I don’t understand is that in the documentation it describes notifying the “appropriate contacts that the host is down.” How does it know what the appropriate contacts are for the host? Does it just assume that the appropriate contacts for the services running on that host are the same as the host itself?


#10

I seem to be getting more hits on the uk forum. Plus, there’s no sense in me and others typing twice. I’m putting a link here to the other forum.

[http://www.nagios-forum.co.uk/forum/viewtopic.php?p=138#138](hyperlink url)


#11

event_handler_enabled 1, turn that off please and try it again.