Multiple Services Integration

Hi there,
ia have a question relative to how monitoring , al last 5 services in differents servers .

The idea is that , with diffrent services, if one fails not alert me and the alert send a notification GENERAL that indicates that the whole sum of service, call the SUM_OF_SERVICE is not OK.

SUM_OF_SERVICE not exist, then how could nagios alert me of??

Example:

Service tnsping
Service Pmon
Service WEB
Service LDAP
Service LOGIN

When one of the above services fails, i dont want to receive notifications of the others…cause that services whole conjuntion are the SERVICE X ( any service fails leads to a SERVICE X fail).
I have tried with dependecies but this not resolve the problem due that SERVICE X not exist,is a “abstract Service”…

Excuse my poor english but I hope that the general idea is understood.

You can make this with Nagios?

Thanks

Easiest way to do that (if you are absolutely sure that one service failing results in all others failing plus the SUM_OF_SERVICE fail), is to disable notifications for all those 5 services and create a new service called SUM_OF_SERVICE (which you can add to Nagios server host, because all of these services are on different servers, as you’ve said) which would have notifications enabled.
With that done, enable event_handlers for those 5 services and use submit_check_result to the SUM_OF_SERVICE service with the message “All 5 services are Critical” and exit with the critical exit code (2). For that be sure to disable (value 0) is_volatile variable in SUM_OF_SERVICE service definition cause you don’t wan’t to get same 5 messages for the service.

For example:

define service{ host_name Nagios_server service_description SUM_OF_SERVICE is_volatile 0 check_command any max_check_attempts # normal_check_interval # retry_check_interval # active_checks_enabled 0 passive_checks_enabled 1 check_period timeperiod_name check_freshness 0 notification_interval # notification_period timeperiod_name notification_options w,u,c,r,f notifications_enabled 1 contact_groups contact_groups }

define command{ command_name submit_check_result command_line /usr/local/nagios/libexec/eventhandlers/submit_check_result }

define service{ host_name host_tnsping service_description tnsping servicegroups servicegroup_names check_command command_name max_check_attempts # normal_check_interval # retry_check_interval # active_checks_enabled 1 passive_checks_enabled 0 check_period timeperiod_name event_handler submit_check_result!Nagios_server!SUM_OF_SERVICE!2!"All 5 services are Critical" event_handler_enabled 1 notifications_enabled 0 }

The last configuration should be applied to all of other services (Pmon, WEB, LDAP, LOGIN).

I haven’t tested this, I have written this as an idea from my head. So test this, and see if it works
I hope I understood well what you want. There are other ways to accomplish this, for example you could write a script that would check all those 5 services and return Critical exi code if any of them fails. If you know how to write scripts or some sort of programming, you could try to do that, and assign that script to be the check_command for SUM_OF SERVICE.

Good idea Albin,
I see one obstacle in definition of SUM_OF_SERVICE:
You mean this be a PASSIVE SERVICE?
The check_command directive is “mandatory” to fill in?
I think thats the mess with the “ethereal” or “abstract” service.

Anyway i,m starting the tests.

Thank you.

Hm, really I don’t know if it is neccesary to fill in the check_command. If Nagios cannot work without it just put the check_ping or something, it doesn’t matter, 'cause it won’t be executed, because freshness and active checks are disabled.

And yes, it would be a passive service, cause the state of it doesn’t matter until one of the real services fails.

Good luck :slight_smile:

Mhhh we are talking about a PASSIVE CHECK without NSCA and in the same machine??
What changes are necesary for the submit check result to working?( this script is for use with NSCA )

submit_check_result should work without NSCA as a standalone script. Try to run

from the terminal and check the nagios.log and the web GUI for that service. It should appear as Critical with the Info: “All 5 services are Critical”

Of course, when running that command you should respect the PATH to your submit_check_result script, and both Nagios_server host and the service SUM_OF_SERVICE for that host should be implemented in nagios configuration.
Service can be configured as passive no matter of NSCA. For example, I have passive services configured which get the status and check output via submit_check_result only when snmp-trap is received by the Nagios server.

This is my submit_check_result
dont work with: /usr/local/nagios/libexec/eventhandlers/submit_check_result Nagios_server SUM_OF_SERVICE 2 “All 5 services are Critical”

[blockquote]#!/bin/sh

Arguments:

$1 = host_name (Short name of host that the service is

associated with)

$2 = svc_description (Description of the service)

$3 = state_string (A string representing the status of

the given service - “OK”, “WARNING”, “CRITICAL”

or “UNKNOWN”)

$4 = plugin_output (A text string that should be used

as the plugin output for the service checks)

Convert the state string to the corresponding return code

return_code=-1

case “$3” in
OK)
return_code=0
;;
WARNING)
return_code=1
;;
CRITICAL)
return_code=2
;;
UNKNOWN)
return_code=-1
;;
esac

pipe the service check info into the send_nsca program, which

in turn transmits the data to the nsca daemon on the central

monitoring server

/usr/bin/printf “%s\t%s\t%s\t%s\n” “$1” “$2” “$return_code” “$4” | /usr/local/nagios/bin/send_nsca 172.18.24.9 -c /usr/local/nagios/etc/send_nsca.cfg
[/blockquote]

Since ist oriented to NSCA what changes must be done to work?

[code]#!/bin/sh

SUBMIT_CHECK_RESULT

Written by Ethan Galstad ([email protected])

Last Modified: 02-18-2002

This script will write a command to the Nagios command

file to cause Nagios to process a passive service check

result. Note: This script is intended to be run on the

same host that is running Nagios. If you want to

submit passive check results from a remote machine, look

at using the nsca addon.

Arguments:

$1 = host_name (Short name of host that the service is

associated with)

$2 = svc_description (Description of the service)

$3 = return_code (An integer that determines the state

of the service check, 0=OK, 1=WARNING, 2=CRITICAL,

3=UNKNOWN).

$4 = plugin_output (A text string that should be used

as the plugin output for the service check)

echocmd="/bin/echo"

CommandFile="/var/lib/nagios3/rw/nagios.cmd"

get the current date/time in seconds since UNIX epoch

datetime=date +%s

create the command line to add to the command file

cmdline="$datetime] PROCESS_SERVICE_CHECK_RESULT;$1;$2;$3;$4"

append the command to the end of the command file

$echocmd $cmdline >> $CommandFile
[/code]

Here’s my submit_check_result. Save this in a new file, for example, submit_passive_service, and give it the same permissions as submit_check_result has.

OK thats works…
The problem now is that se SUM_OF_SERVICE is always critical cause the event handler for each service is:
event_handler submit_check_result!Nagios_server!SUM_OF_SERVICE!2!“All 5 services are Critical”

Ok.
As SUM_OF_SERVICE is a passive service, it doesn’t get any check_result, but from the event_handler. If you want to reset the SUM_OF_SERVICE to OK, you could add this to the SUM_OF_SERVICE service definition:

check_command reset_to_ok check_freshness 1 freshness_threshold 60

where you enable freshness check and it’s threshold (in this example I’ve set it to 60, which will mean if in 60 seconds there is no passive check result submitted, Nagios will execute the check command defined in check_command variable, in this case reset_to_ok, no matter if the service is passive). You can change the value of threshold to whatever you want.

And if you don’t want to get notified when service is reset to OK, just modify the notification_options in the SUM_OF_SERVICE service definition:

Define the **reset_to_ok **command:

define command{ command_name reset_to_ok command_line /usr/local/nagios/libexec/eventhandlers/reset_to_ok.sh }

And the reset_to_ok.sh script would be something like this:

#!/bin/bash echo "No recent passive check result submitted" && exit 0

And of course set the permissions for reset_to_ok.sh so Nagios could execute the script

BEfore that…i have this from /var/log/messages

Oct 22 13:33:09 svrsginagios nagios: SERVICE ALERT: oimpreprod;Check HTTP OIM;CRITICAL;SOFT;1;Conexión rehusada
Oct 22 13:33:09 svrsginagios nagios: SERVICE EVENT HANDLER: oimpreprod;Check HTTP OIM;CRITICAL;SOFT;1;submit_check_result!svrsginagios!NIVEL_SERVIC IO_OIM!2!"NIVEL DE SERVICIO DE OIM CRITICO"
Oct 22 13:33:09 svrsginagios nagios: EXTERNAL COMMAND: PROCESS_SERVICE_CHECK_RESULT;;;;
Oct 22 13:33:09 svrsginagios nagios: Warning: Passive check result was received for service ‘’ on host ‘’, but the host could not be found!

Have your tried to do it reversely with the arguments in the command definition and event_handlers of all serfvices without them? (I suppose the result always goes to the same service on the same server) Like this:

[code]define command{
command_name submit_check_result
command_line /usr/local/nagios/libexec/eventhandlers/submit_check_result Nagios_server SUM_OF_SERVICE 2 “All 5 services are Critical”
}

define service{
host_name host_tnsping
service_description tnsping
servicegroups servicegroup_names
check_command command_name
max_check_attempts #
normal_check_interval #
retry_check_interval #
active_checks_enabled 1
passive_checks_enabled 0
check_period timeperiod_name
event_handler submit_check_result
event_handler_enabled 1
notifications_enabled 0
}[/code]

This should work. Check for any mistyping errors in your syntaxes. :slight_smile:

Tests Working…now to test the RESET TO OK
Its a dificult and mess i know.

Only one thing about this: When Event Handler is triggered the result is always CRITICAL ( Even in a OK state ) .
I mean, the SUM_OF_ALL_SERVICE is on critical state, then RECOVERS, but the trigger execute the eventhandler submit_check_result and tell that we have a CRITICAL STATE ( when really is OK).

The when i test the reset_to_ok ill have to adjust the timing to avoid “false positives” , i think.
Also, if the service is really critical with the reset to OK, is a bad thing, you know?

Thanks Albin, hope that working fine with the reset to OK
I told you tomorrow