Critical when socket timeout


#1

hi
i configured my nagios server to send email only when states are critical, and we want to receive email only when there are REAL problems. We monitor a lot of servers, and sometimes people change stuff or stop the firewall, and we get a lot of socket timeout on some services, and it goes critical, and we receive email about a not real problem.

So to put it short: is there a way that i can change the


#2

Every nagios plugin i know will exit with a code of 2 if it times out. Nagios reads the exit code on all scripts ran, and will critical if exit code is 2.

You will either have to recompile/rescript every plugin you use to not exit with code 2 on timeout, or write some sort of master script you’d use for all of your service chekcs that executes what is passed to it, matches timeout somehow in the output of the plugin it runs, then exits 0 if “socket timeout” or whatever is found.

If a socket timeout isn’t a “real problem”, then i’d question whether its worth monitoring. If you’d like nagios to try for longer before changing state to HARD (critical) and sending an alert out, modify your retry_check_interval and max_check_attempts options in your service checks. That should give enough time for people to restart the firewall or whatever.


#3

i see, i set the retry_check_interval to more time, and increased to 8 the number of attempts, but we are still getting a lot of emails. Isnt there an easier way other than recompiling the plugins to make nagios not count as critical the socket timeouts?


#4

As MP said:

that’s a good way to override the critical alerts you don’t need (and also any other alerts).
I did it on our system, and it works fine


#5

how do i make all my service checks execute a script? And could you send me that script via email or post here? thanks


#6

I’ve lost my old script ('cause we solved our socket timeout errors), but the script could look like this:
(it’s pretty simple; the script just gets all the necessary arguments and launches the test; then, we use a pattern match)

[code]#!/usr/bin/perl

my $commande = $ARGV[0];
my $libexec_dir = “$ENV{NAGIOS_HOME}/libexec”; # we use a NAGIOS_HOME variable; which is really usefull :wink:
my $command = “$libexec_dir/$commande “$ARGV[1]” “$ARGV[2]” “$ARGV[3]” “$ARGV[4]” “$ARGV[5]” “$ARGV[6]” “$ARGV[7]” “$ARGV[8]” “$ARGV[9]” “$ARGV[10]” “$ARGV[11]” “$ARGV[12]” “$ARGV[13]” “$ARGV[14]” “$ARGV[15]” “$ARGV[16]” “$ARGV[17]” “$ARGV[18]””;
my $result = $command;

my $exit_code = $?/256;
print $result;
if ( $result =~ /socket timeout/ ) { exit 1; } # Not sure about the exact syntax of “socket timeout”; anyway, you can put whatever you like in here
exit $exit_code;
[/code]

btw: it’s in perl … which may not be the best solution, as this script will be launched often … I’d suggest you do the same in C or whatever you like :slight_smile:


#7

thanks a lot, that worked nicely!