Host down not alerted

bluethundr · August 11, 2009, 3:18pm

Hello,

I have a new nagios 3 configuration. I am running Debian Lenny and installed nagios through apt-get.

Last night I tried to simulate an event by popping the nic on the backup server.

I got notifications that SSH and Ping were down. But Current Load, and Disk Space continued reporting OK. And when you clicked on host detail the status was Host Up. It took about 5 minutes for SSH/Ping to report down. But I need the notifications to be sent near instantaneous. How do I get the notifications to be sent more quickly? What do I have to do to get nagios to realize that a host is DOWN?

Here is the config for the host that I brought down:

define host{
        use                             generic-host
        host_name                       backup
        alias                           Backup Server
        address                         192.168.1.14
        check_command                   check-host-alive
        check_command                   check_ping!200.0,20%!600.0,60% 
        check_command                   check_all_disks!20%!10%
        check_command                   check_procs!250!400
        parents                         dlink-dgs1224t,gateway
        }

Here is the check_all_disks service definition (I put it in the hosts.cfg file where I keep my host definitions).

# Define a service to check the disk space of the root partition
# on the local machine.  Warning if < 20% free, critical if
# < 10% free space on partition.

define service{
        use                             generic-service         ; Name of service template to use
        hostgroup                       debian-servers
        service_description             Disk Space
        check_command                   check_all_disks!20%!10%
        }

luca · August 11, 2009, 7:47pm

disk space is a local check… it runs on the local machine…

if you want almost immediate notifications you will be making way more checks and the nagios server will be under more load… it really depens on how many servers you want to check and how often… standard for nagios is the one minute time unit (check nagios.cfg)… so it will check once a minute, if a check fails at least one or two retries should be made, that would make up for 2-3 minutes…

most configs i’ve seen run on 5-10 minutes interval between checks and with 1-2 minutes retry interval and 2-3 retries before notifing.

You may use a more immediate way if you get nagios to use snmp traps from a switch for a port down… that may really be almost immediate… but there’s a lot of config work in there

PS: If you really want to tweak your nagios config, i’d recommend starting by installing from source files, this will require a bit of knowledge of how nagios works, and will make it easier to understand what’s happening while tweaking around

bluethundr · August 12, 2009, 12:51pm

Thanks! I guess 5 minutes notification is not bad. I can live with that.

But how do I get nagios to admit that a host is DOWN? As opposed to just SSH/PING being down?

thanks!

luca · August 12, 2009, 4:29pm

i just saw your config…

define host{
use generic-host
host_name backup
alias Backup Server
address 192.168.1.14
check_command check-host-alive
check_command check_ping!200.0,20%!600.0,60%
check_command check_all_disks!20%!10%
check_command check_procs!250!400
parents dlink-dgs1224t,gateway
}

check command is the command to be executed to check if the host is up or down, to check disks and whatever else you have to define services… check the example configs and go on from there