Forcing checks on dependent services


#1

Hi,

Hope someone can help me with this one.

I have a Nagios installation which monitors approximately 40 servers.

These servers all have between 1 and 32 service checks running on them.

The bulk of these service checks have been implemented using SNMP (Corporate requirement) and occassionally the SNMP Daemon on a server will stop responding and need restarting.

I have setup service dependencies for all these services to a service called ‘SNMP’ on each server in the hopes that if the daemon dies I get one alert telling me it’s dead instead of 20 alerts complaining about every service on the box.

Unfortunately as far as I can see Service Dependencies use the current known state for the SNMP Service.

This would work if the SNMP service got checked before the others but that isn’t guaranteed.

I therefore can get notifications for the other services up until SNMP is checked at its scheduled interval at which point the notifications stop until the SNMP Daemon is restarted.

What I want to know is if there is some way to make Nagios test SNMP automatically on the failure of one of it’s dependent services.

Any and all ideas appreciated!

Troy Daniels.
Systems Administrator
iTouch Australia.


#2

If that is how it’s working for you, then you set it up wrong. In V1.2 at least, if I setup dependancies, I get ONE email.
If some check fails, then it walks the dependancy tree, untill it finds the top dog. The top dog is the only one you get an email for.

I suspect, that you have your contact setup to get notifications for WARNING, UNKNOWN, or CRITICAL, when actually, we don’t care about unknown’s.

You have a check for the “snmpd is running” and that is going to fail hard, it will return “down” so that is the only email you want.

Change your contact definition or maybe the service definitions.
notification_options [w,u,c,r] wrong
notification_options [w,c,r] correct

Edited Mon Jul 18 2005, 09:00AM ]


#3

That behaviour is what I was expecting too but it’s not what I got…

I get emails when a problem occurs and I recieved emails for 2 services 3 minutes before SNMP was detected as down.

Once SNMP was detected as down I got no further emails until the SNMP recovery message came out.

Based on this behaviour it’s looking like SNMP isn’t being checked automatically when one of the other services goes down.

I’ve checked my dependency entries and they look right to me:

define servicedependency {
host_name svr1
service_description SNMP
dependent_host_name svr1
dependent_service_description Mem Used
notification_failure_criteria w,u,c
}

define service{
use generic-service ; Name of service template to use

    host_name                          svr1,svr2
    service_description                SNMP
    is_volatile                        0
    check_period                       24x7
    max_check_attempts                 1
    normal_check_interval              5
    retry_check_interval               1
    contact_groups                     sysadmins
    notification_interval              120
    notification_period                24x7
    notification_period             24x7
    notification_interval           120
    notification_options            w,u,c,r
    check_command                      check_snmp_up
    }

I’m running Nagios 2.0b3.

I checked the event log and there is no record of it performing a check on SNMP until shortly before the notification (ie 3 mins after prob was first detected)

I can see that Nagios is walking the dependency tree OK, but it’s only using the current state of the service, and not performing a recheck.

Troy.

Edited Mon Jul 18 2005, 08:21PM ]


#4

Common mistake, you have it backwards.
Please read this very carefully and study the .png picture.
nagios.sourceforge.net/docs/1_0/ … ncies.html
nagios.sourceforge.net/docs/1_0/ … encies.png

SNMP daemon has to be running before anything else is going to work. So, the
dependent_service_description snmp #but that’s not what you have.

Think of it like this. For every host/servcie definition that you have, is there some host/service that you need in order for that service check to work? If so, then define that service as the
dependent_service_description and dependent_host_name for that service check.


#5

Doh! I understand now!

I thought you defined the services that were dependent on the SNMP service, not the other way around.

Thanks heaps for this!


#6

Yea, the wording is completely confusing, and you really can’t think about it very long, or it starts to hurt.

The best thing to do, is to get it working, and then just repeat and flush without thinking about it. In other words, whatever works is what you stick with, and ignore the words chosen by the programmer.


#7

Just thought I’d post an update to this problem.

I made the dependency definition changes as suggested.

However, since then SNMP has been behaviing itself so they haven’t been tested. On the weekend this changed - SNMP fell over.

Unfortunately, instead of 1 SNMP is down email I got 15 ‘Every service except SNMP’ is down emails!

This prompted me to investigate again, and I reread the documentation. I discovered I’d had the definition around the right way the first time, however Nagios only uses “the current state” of the service as explained in the following line:

[blockquote] **
One important thing to note is that by default, Nagios will use the most current hard state of the service(s) that is/are being depended upon when it does the dependeny checks. If you want Nagios to use the most current state of the services (regardless of whether its a soft or hard state), enable the soft_service_dependencies option.
**[/blockquote]

Reading this, it looks like there may be no way to force a check of SNMP if one of the other services fails. This might be a Nagios 2.0+ thing.

For now, I’m going to look at tweaking the service retry times, as well as activating the soft_service_dependencies option and see how that works.

If someone knows a better way to force service check execution in this situation, I’m all ears!

Troy Daniels
Systems Administrator
iTouch Australia.


#8

nagios.sourceforge.net/docs/1_0/ … pendencies

Glad you found that in the docs, I just read that too.