I am having a problem were some hosts get “stuck” in the service/host check logic. Let me begin with my configuration:
I have around 2000 monitored hosts on my Nagios 3.2.0 installation with large installation options enabled. Running on Centos 5 virtual machine using Xen.
Service ping uses check_icmp and must fail 3 times (retry = 1 min) before a host check for the host is scheduled. Host check uses check_icmp and must fail 3 times (retry = 0) before a notification is sent out. Notifications are disabled for service ping, but are enabled for host checks. Details are at the bottom.
For random hosts that go down, service check “ping” fails 3/3, but no host check is scheduled for the hosts, so the host state state stays “up”, and therefore no notifications are sent out for the host. If I manually re-schedule and force a host check for the problem hosts, then it gets “unstuck” and all is well for that host.
The logic doesn’t seem to continue from failing all 3 service checks to initiating an on-demand host check. I have heard that there are scheduling problems with 3.2.0, but I haven’t heard of this specific problem.
This happens for random hosts that go down… most hosts that go down work fine… its is only a few random hosts here and there. I cannot find a pattern. Log file mentions nothing out of the ordinary.
These configurations have been used for long time since 2.x. After upgrading from 3.0.6 to 3.2.0, this problem manifests itself.
Please let me know if anything is not clear.
define service {
service_description ping
notifications_enabled 0
is_volatile 0
max_check_attempts 3
check_interval 1
retry_interval 1
active_checks_enabled 1
passive_checks_enabled 0
check_period 24x7
obsess_over_service 0
check_freshness 0
event_handler_enabled 0
flap_detection_enabled 1
flap_detection_options o,w,c
process_perf_data 0
retain_status_information 1
retain_nonstatus_information 1
notification_interval 1440
notification_period 24x7
notification_options w,r,c,f
register 1
}
define host {
max_check_attempts 3
check_interval 0
retry_interval 0
active_checks_enabled 1
passive_checks_enabled 1
check_period workhours
obsess_over_host 0
check_freshness 0
event_handler_enabled 0
low_flap_threshold 0
high_flap_threshold 0
flap_detection_enabled 1
flap_detection_options d
process_perf_data 0
retain_status_information 1
retain_nonstatus_information 1
notification_interval 1440
first_notification_delay 0
notification_options d,r,f
register 1
}