Service alerts for a host that is down?

merickson · August 6, 2009, 5:52pm

We recently deployed Nagios 3.1.2 on our network.

Each time a host goes down, we receive alerts for each of the services running on the host, as well as for the host itself. What is the easiest way to have Nagios suppress notifications for individual services, if the entire host that the services run on is offline?

Do I have to explicitly tell Nagios that each service depends on the host that it runs on? My impression from the documentation was that this would be automatic, but maybe I’m missing something.

Thanks for any suggestions,

Miles

merickson · August 6, 2009, 6:06pm

Here’s the config file for one of the servers in question. As you can see, it’s pretty vanilla at this point:

[code]###############################################################################

exchange.cfg

###############################################################################

###############################################################################
###############################################################################

HOST DEFINITIONS

###############################################################################
###############################################################################

Define a host for the Windows machine we’ll be monitoring

define host{
use windows-server ; Inherit default values from a template
host_name exchange ; The name we’re giving to this host
alias Exchange Mail Server ; A longer name associated with the host
address 192.168.0.150 ; IP address of the host
check_period almost_24x7 ; Ignore scheduled restart between 1:30 and 2:00 a.m.
}

###############################################################################
###############################################################################

SERVICE DEFINITIONS

###############################################################################
###############################################################################

Create a service for monitoring the version of NSCLient++ that is installed

define service{
use generic-service
host_name exchange
service_description NSClient++ Version
check_command check_nt!CLIENTVERSION
check_period almost_24x7
}

Create a service for monitoring the uptime of the server

define service{
use generic-service
host_name exchange
service_description Uptime
check_command check_nt!UPTIME
check_period almost_24x7
}

Create a service for monitoring CPU load

define service{
use generic-service
host_name exchange
service_description CPU Load
check_command check_nt!CPULOAD!-l 5,80,90
check_period almost_24x7
}

Create a service for monitoring memory usage

define service{
use generic-service
host_name exchange
service_description Memory Usage
check_command check_nt!MEMUSE!-w 80 -c 90
check_period almost_24x7
}

Create a service for monitoring C:\ disk usage

define service{
use generic-service
host_name exchange
service_description C:\ Drive Space
check_command check_nt!USEDDISKSPACE!-l c -w 80 -c 90
check_period almost_24x7
}

Create a service for monitoring D:\ disk usage

define service{
use generic-service
host_name exchange
service_description D:\ Drive Space
check_command check_nt!USEDDISKSPACE!-l d -w 80 -c 90
check_period almost_24x7
}

Create services for monitoring the W3SVC service

define service{
use generic-service
host_name exchange
service_description W3SVC
check_command check_nt!SERVICESTATE!-d SHOWALL -l W3SVC
check_period almost_24x7
}

Create a service for monitoring the Explorer.exe process

#define service{

use generic-service

host_name exchange

service_description Explorer

check_command check_nt!PROCSTATE!-d SHOWALL -l Explorer.exe

check_period almost_24x7

}

[/code]

luca · August 6, 2009, 7:32pm

did you change something in nagios.cfg ? The default behiavour should be not to send notifications for services if the host is down… are you getting a host down notification too?

merickson · August 6, 2009, 8:58pm

Hi luca,

Yes, we receive a host down notification as well, but it doesn’t necessarily come before the service notifications.

e.g. recently when a server was rebooted, we saw this sequence of e-mail alerts:

** PROBLEM: servername CPU Load is CRITICAL ** ** PROBLEM: servername NSClient++ Version is CRITICAL ** ** PROBLEM: servername Uptime is CRITICAL ** ** PROBLEM Host Alert: servername is DOWN ** ** RECOVERY Host Alert: servername is UP ** ** RECOVERY: servername CPU Load is OK ** ** RECOVERY: servername NSClient++ Version is OK ** ** RECOVERY: servername Uptime is OK **

Of course, each of those “critical” service alerts was really a “Socket timeout after 10 seconds” error because the host had already gone down. We did not receive alerts for drive space or memory usage, which indicates that Nagios did stop alerting for services after the host-down message went out.

I thought I had read in the documentation that Nagios would check the host status before sending a service alert, but it appears that this isn’t happening. Is there anything we can do? Does the maximum number of attempts for the host have to be smaller than for each service, in order for the host to go DOWN before any services go CRITICAL?

luca · August 7, 2009, 7:25am

[quote=“merickson”]
I thought I had read in the documentation that Nagios would check the host status before sending a service alert, but it appears that this isn’t happening. Is there anything we can do? Does the maximum number of attempts for the host have to be smaller than for each service, in order for the host to go DOWN before any services go CRITICAL?[/quote]

That’s how i remeber it should be working and i don’t rember having ever the need to make services dependant upon their own host.

Sorry, if you didn’t change some other option i don’t know what it could be, maybe somebodyu else already had this.

merickson · August 7, 2009, 9:42pm

I have “sort of” resolved this by ensuring that hosts go “down” before services go “critical”. Bascically, both hosts and services are both set to be checked every 15 seconds, but hosts have max_check_attempts set to 2 and services have this value set to 4. This means that when a host goes down, it generates a host alert 30 seconds before the first service alert, and this suppresses service notifications for hosts that are down.

It’s a kludge, but it’s better than getting 14 e-mail alerts (one for the server and one for each service on that server) every time a server reboots.

If anyone has any advice about how to get Nagios to check the host automatically when a service goes down before it sends a service alert, that would still be better.

Cheers!

Miles