"Service dependencies" feature doesn't work as expected!

czorzella · June 18, 2009, 9:13pm

We’ve are having troubles to make “Service dependencies” feature work as described in Nagios user’s manual.

Our Scenario:
In order to avoid confusion, let’s consider an environment with an absolutely fresh install of Nagios with one single host being monitored. Let’s say that host has 2 services being monitored as well, PING and disk_space. If we make disk_space service dependent of ping service, results are odd depending on the order the services are scheduled to be tested in the queue.

For example: lets suppose the PING service is down and if by chance it was scheduled to be tested before disk_space, then service dependencies feature will work as expected every time, otherwise, if PING service is scheduled by chance to be tested after disk_space services, then the service dependencies testing logic will be completely ignored and both services will fail all times.

Our Goal:
Assure Nagios behave exactly as stated in its user manual; Please refer to information described on item “How Service Dependencies Are Tested” on nagios.sourceforge.net/docs/2_0/ … ncies.html

Follows a partial transcript of this document:

Before Nagios executes a service check or sends notifications out for a service, it will check to see if the service has any dependencies. If it doesn’t have any dependencies, the check is executed or the notification is sent out as it normally would be. If the service does have one or more dependencies, Nagios will check each dependency entry as follows:

Nagios gets the current status* of the service that is being depended upon.
Nagios compares the current status of the service that is being depended upon against either the execution or notification failure options in the dependency definition (whichever one is relevant at the time).
If the current status of the service that is being depended upon matches one of the failure options, the dependency is said to have failed and Nagios will break out of the dependency check loop.
If the current state of the service that is being depended upon does not match any of the failure options for the dependency entry, the dependency is said to have passed and Nagios will go on and check the next dependency entry.

This cycle continues until either all dependencies for the service have been checked or until one dependency check fails.

*One important thing to note is that by default, Nagios will use the most current hard state of the service(s) that is/are being depended upon when it does the dependeny checks. If you want Nagios to use the most current state of the services (regardless of whether its a soft or hard state), enable the soft_service_dependencies option.

***** HOSTS ************************************

define host {
	name				host-default
	notifications_enabled				1
	event_handler_enabled				1
	flap_detection_enabled				1
	process_perf_data				1
	retain_status_information				1
	retain_nonstatus_information				1
	register				0
}

define host {
	use			host-default
	host_name		HOST_A
	address			XXX.XXX.XX.X
	max_check_attempts	10
	notification_interval	120
	notification_period	24x7
	notification_options	d,u,r
	contact_groups		ADM
}


***** SERVICES *************************************

define service {
	name				service-default
	active_checks_enabled				1
	passive_checks_enabled				1
	parallelize_check				1
	obsess_over_service				1
	check_freshness				0
	notifications_enabled				1
	event_handler_enabled				1
	flap_detection_enabled				1
	process_perf_data				0
	retain_status_information				1
	retain_nonstatus_information				1
	register				0
}

define service {
	use			service-default
	host_name		HOST_A
	service_description	PING
	check_period		24x7
	max_check_attempts	3
	normal_check_interval	1
	retry_check_interval	1
	contact_groups		ADM
	notification_interval	120
	notification_period	24x7
	notification_options	w,c,r
	check_command		check_ping!900.0,50%!1200.0,80%
}

define service {
	use			service-default
	host_name		HOST_A
	service_description	DISK_C
	check_period		24x7
	max_check_attempts	3
	normal_check_interval	3
	retry_check_interval	1
	contact_groups		ADM
	notification_interval	120
	notification_period	24x7
	notification_options	w,c,r
	check_command		check_disk!C!80%!95%
}

***** DEPENDENCIES *************************************

define servicedependency{
	host_name HOST_A
	service_description PING
	dependent_host_name HOST_A
	dependent_service_description DISK_C
	execution_failure_criteria w,u,c
	notification_failure_criteria w,u,c
}

g1_ · June 19, 2009, 7:46am

This is what I have:


define service{
          use                 generic-service
          host_name           xxx
          service_description revt
          check_command       check_revt
          check_interval      5
          max_check_attempts  1
          }

define service{
          use                 generic-service
          host_name           xxx
          service_description load
          check_command       check_by_rssh!load
          check_interval      30
          max_check_attempts  3
          retry_interval      3
          }

define servicedependency{
        host_name                       xxx
        service_description             revt
        dependent_host_name             xxx
        dependent_service_description   load
        execution_failure_criteria   c,u
        notification_failure_criteria   c,u
        }

So in my example revt plays the same role as ping and cpu load is equivalent to disk space.

The trick is to set check_interval, max_check_attempts and retry_interval for both services in a way that garentee’s that the ping results are up to date before check disk gets to hard critical.

In my example I check revt every 5 minutes and it only requires one fail to become hard critical.
Check load checks every 30min and it requires 3 fails to become hard critical. When check load fails the first time it schedules the next check for 3 minutes later (retry_interval). Hence from when it fails the first time it has to fail 2 more times which will take 2x3=6 minutes. 6 minutes > than the check_interval for revt. Hence we know that revt is up to date before check_load will fail 3 times.

Not sure if that helps…

czorzella · June 19, 2009, 12:01pm

[quote=“g1_”]This is what I have:


define service{
          use                 generic-service
          host_name           xxx
          service_description revt
          check_command       check_revt
          check_interval      5
          max_check_attempts  1
          }

define service{
          use                 generic-service
          host_name           xxx
          service_description load
          check_command       check_by_rssh!load
          check_interval      30
          max_check_attempts  3
          retry_interval      3
          }

define servicedependency{
        host_name                       xxx
        service_description             revt
        dependent_host_name             xxx
        dependent_service_description   load
        execution_failure_criteria   c,u
        notification_failure_criteria   c,u
        }

So in my example revt plays the same role as ping and cpu load is equivalent to disk space.

The trick is to set check_interval, max_check_attempts and retry_interval for both services in a way that garentee’s that the ping results are up to date before check disk gets to hard critical.

In my example I check revt every 5 minutes and it only requires one fail to become hard critical.
Check load checks every 30min and it requires 3 fails to become hard critical. When check load fails the first time it schedules the next check for 3 minutes later (retry_interval). Hence from when it fails the first time it has to fail 2 more times which will take 2x3=6 minutes. 6 minutes > than the check_interval for revt. Hence we know that revt is up to date before check_load will fail 3 times. [/quote]

Dear G1,

Thank you for your input.

Let me demonstrate how the model you’ve described has an enormous chance to fail, as follows:

In timetable below, please consider that:

A is equivalent to (PING) or (REVT)
B is equivalent to (disk space) or (cpu load)
B is dependent of A

00 – 05 – 10 – 15 – 20 – 25 – 30 – 35 – 40 – 45 – 50 – 55 – 60 - (minutes)
|-----|-----|-----|-----|-----|-----|-----|-----|-----|-----|-----|-----|
A---- A — A---- A — A---- A ---- A---- A — A---- A — A---- A ----A
B ------------------------------------ B ----------------------------------- B

If A fails in between 26 and 30 minutes (same in between 56 and 60 minutes), then B will ignore the fact that A is actually down and it will fail.

You can reproduce this odd behavior anytime and even predict results by simulating a failure in between 26 and 30 minutes (same in between 56 and 60 minutes); Just keep an eye at the schedule queue; if A is scheduled to be tested before B, then dependencies will work as expected, otherwise, it won’t!

The problem I see is that Nagios “trust” last reading of A and considers it to be good enough for the dependencies logic evaluation process. My perception is that we’ve got to find a way to “force” Nagios to update status of A when dependencies logic is applied.

Please let me know your thoughts about it.

rojwilco · July 31, 2009, 2:36pm

Doesn’t the predictive dependency checks solve this problem?

nagios.sourceforge.net/docs/3_0/ … hecks.html

As I understand it, whenever there is a object which goes into a state where it triggers a notification, Nagios will check if it depends on any thing else before going ahead with the notification.

I’m still having a problem, though, where it looks like this doesn’t work right. I have 5 hosts all monitored at a remote site, and I have the default gateway for that site monitored as well. I set up the dependencies as follows:

define hostdependency {
        host_name               remote-gateway
        dependent_host_name     host1
        notification_failure_criteria   d,u
}

…for each dependent host. However, whenever the remote-gateway goes down and needs to send a host down notification, I get a HOST_DOWN notification for the gateway AND an UNKNOWN notification for each dependent host. Have I set this up incorrectly?

luca · August 3, 2009, 2:16pm

if a host has a parent relation (which you porbably have) if the parent host goes down the deopandt host goes into unknown state.
You may want to remove notifications for the UNKNOWN status