Notification for exceeding uptime?

Goldie · February 13, 2008, 2:52pm

Hi Folks,

I’ve been using Nagios for some months now and I’m really happy with it.

Today I was wondering if it is possible to have Nagios sent a notification E-mail when the uptime of a specific server exceeds for example 90 days…

I am working on a server-reboot-scheme and it would be nice if I received notification mails when it is about time to reboot a specific server, if it did not occur earlier due to memleaks or something.

Thx in advance!

Grtz,

Goldie

Strides · February 14, 2008, 11:02am

Hi

Are you using something like check_nt!UPTIME? That command does have -w (arn) and -c (ritical) switches so it may be possible to use these to evaluate the uptime and generate emails based on the results. I don’t monitor any windows servers myself so can’t test it for you but a quick check from your Nagios commandline should answer that… it’s probably something like
./check_nt -v UPTIME -H <ip_address> -w -c
As like as not though this won’t work whatsoever and the -w & -c switches will be either ignored or evaluate incorrectly, or check_nt will throw an exception or something…

The other way to do it would be to write your own check, and inside that you call check_nt (or whatever check command you are currently using) to retrieve the uptime on your server as an input to a routine to evaluate the response, then output that evaluation as an OK or Critical condition. Shouldn’t be too hard to evaluate check_nt’s output, which will be something like “System Uptime - 18 day(s) 2 hour(s) 56 minute(s)” and match against whatever is sitting between “System Uptime - " and " day”… Your new script will only need to take your “warning” value as an input figure from Nagios, plus it should be able to recieve the $HOSTADDRESS$ which you can send to the script from the configuration for your new command in command.cfg (for use in check_nt’s -H variable). I say shouldn’t be too hard, but that assumes you have some sort of experience writing in this sort of thing… If you don’t, give it a shot anyway. It’ll be a good way to learn something like perl

HTH

/S

Goldie · February 14, 2008, 2:59pm

Strides,

Thx for the reply.
The -w and -c flags don’t work. Already tried them before posting here.
I have no experience or whatsoever with perl, I gave it a shot today, but don’t have te time to keep looking for it. Maybe someone who has some xperience with making new commands knows a solution.

Grtz

Strides · February 14, 2008, 3:59pm

ah, ok. Might be able to alter a script I have to do something like what your after, shouldn’t take long. What does the output of check_nt look like on one of your servers thats been up around 90 days… is it like “System Uptime - 92 day(s) 2 ho…” i.e. is it only counting in days rather than weeks/months etc?

Goldie · February 15, 2008, 8:07am

Hello There!

That is really kind.
The output Nagios give with the check_nt command is like this:
System Uptime - 92 day(s) 20 hour(s) 21 minute(s)
Hope I can be of some help for you oneday!

Grtz,
Goldie

Strides · February 15, 2008, 9:31am

Hi Goldie
Pls see attached perl script “check_ntutil_wrapper”. Hopefully this will sort you out (at least the limited simulations I can do without actually having a monitored Windows server to test on seemed to go well…)
I would start off by popping into your libexec directory (if libexec is not “/usr/local/nagios/libexec” then you will need to modify the script slightly at line 69 to reflect the correct location of check_nt)
Then please test it from the command line to make sure it works as expected, i.e. for your 92 day uptime server:
"perl check_ntutil_wrapper -w 95 -c 100 -H " should yield an “OK - Uptime:92 days”
"perl check_ntutil_wrapper -w 90 -c 95 -H " should yield an “Uptime warn threshold exceeded”
"perl check_ntutil_wrapper -w 85 -c 90 -H " should yield an “Uptime crit threshold exceeded”

If all goes as expected, then chmod/chown it appropriately and add it to command.cfg (shout if you don’t get how to do this) and following a service nagios restart, you should be good to go.

Hope it all goes swimmingly for you

Regards

/S

Strides · February 15, 2008, 9:33am

so attaching don’t work then… :-S here it is in full:

#!/usr/bin/perl -w 
my $Version='0.1';
#
# This program is free software; you can redistribute it and/or
# modify it under the terms of the GNU General Public License
# as published by the Free Software Foundation; either version 2
# of the License, or (at your option) any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
# GNU General Public License for more details.
# Licence : GPL - http://www.fsf.org/licenses/gpl.txt
# Developed by Strides in response to request from Goldie on the nagios forums
# http://www.meulie.net/portal_plugins/forum/forum_viewforum.php?21
# Acknowledgements due to Patrick Proy et al, upon whose top script
# 'check_snmp_load' this has been largely based
# Help : perl check_ntutil_wrapper -h
use strict;
use Getopt::Long;
my %ERRORS=('OK'=>0,'WARNING'=>1,'CRITICAL'=>2,'UNKNOWN'=>3,'DEPENDENT'=>4);
my $o_help=     undef;          # wan't some help ?
my $o_host =    undef;          # hostname
my $o_version=  undef;          # print version
my $o_warn=     undef;          # warning level
my $o_crit=     undef;          # critical level
my @o_xnti=     undef;          # check_nt input
my $xntline=    undef;          # output line
my $o_xnto=     undef;          # check_nt output
sub p_version { print "check_ntutil_wrapper version : $Version\n"; }
sub print_usage {
    print "Usage: $0 -H <host> -w <warn level> -c <crit level>\n";
}
sub isnnum { # Return true if arg is not a number
  my $num = shift;
  if ( $num =~ /^(\d+\.?\d*)|(^\.\d+)$/ ) { return 0 ;}
  return 1;
}
sub help {
   print "Utilisation Exceeded warning wrapper script for Nagios check_nt script, version ",$Version,"\n";
   print_usage();
   print <<EOT;
-h, --help
   print this help message
-H, --hostname=HOST
   name or IP address of host to check
-w, --warn=INTEGER
   warning level for uptime in days
-c, --crit=INTEGER
   critical level for uptime in days
-V, --version
   prints version number
EOT
}
sub check_options {
    Getopt::Long::Configure ("bundling");
    GetOptions(
        'h'     => \$o_help,            'help'          => \$o_help,
        'H:s'   => \$o_host,            'hostname:s'    => \$o_host,
        'V'     => \$o_version,         'version'       => \$o_version,
        'w:s'   => \$o_warn,            'warn:s'        => \$o_warn,
        'c:s'   => \$o_crit,            'critical:s'    => \$o_crit,
        );
    if (defined ($o_help) ) { help(); exit $ERRORS{"UNKNOWN"}};
    if (defined($o_version)) { p_version(); exit $ERRORS{"UNKNOWN"}};
    if (! defined($o_host) ) # check host and filter 
        { print_usage(); exit $ERRORS{"UNKNOWN"}}
    if (!defined($o_warn))
        { print "Input warning threshold\n"; print_usage(); exit $ERRORS{"UNKNOWN"}}
    if (!defined($o_crit))
        { print "Input critical threshold\n"; print_usage(); exit $ERRORS{"UNKNOWN"}}
    if ( isnnum($o_warn) || isnnum($o_crit)) 
        { print "Numeric values only for warning or critical !\n";print_usage(); exit $ERRORS{"UNKNOWN"}}
    if ($o_warn >= $o_crit)
        { print "Surely your warning value should be less than the critical :?:\n";print_usage(); exit $ERRORS{"UNKNOWN"}}
}

check_options();
@o_xnti= `/usr/local/nagios/libexec/check_nt -v UPTIME -H $o_host`;
foreach $xntline (@o_xnti) {
        $_=$xntline;
        chomp $_;
        if ($_=~/(System Uptime \- )(.*)( day)(.*)/) {
                $o_xnto=$2;
        }
}
if (!defined($o_xnto)) {
        print "Unable to extract Uptime";
        exit $ERRORS{"UNKNOWN"};
}
elsif (($o_xnto > $o_warn) && ($o_xnto <= $o_crit)) {
        print "Uptime warn threshold exceeded";
        exit $ERRORS{"WARNING"};
}
elsif ($o_xnto > $o_crit) {
        print "Uptime crit threshold exceeded";
        exit $ERRORS{"CRITICAL"};
}
elsif ($o_xnto <= $o_warn) {
        print "OK - Uptime:$o_xnto days";
        exit $ERRORS{"OK"};
}
else {
        print "Uptime defined but unable to evaluate";
        exit $ERRORS{"UNKNOWN"};
}

Goldie · February 15, 2008, 12:50pm

Hello,

Looks great, but still have some probs.
I placed the check_ntutil_wrapper in my nagios/libexec dir. Chmod 755 like others. Then changed owner to right user.
But had some probs with adding it to command.cfg, suddenly my pre-flight check is saying it cannot find nagios.cfg in libexec, but that is not were it is located, so that’s normal. But I don’t know why it suddenly goes looking for that config file over there…

This is what I added in command.cfg, but does not seem to work

‘check_ntutil_wrapper’ command definition

define command{
command_name check_ntutilwrapper
command_line $USER1$/check_ntutil_wrapper -H $HOSTADDRESS$ -w $ARG1$ -c $ARG2$ -s $ARG3$
}

You already made the dark sky grey, and the blue is almost coming out!
Thx

Goldie · February 15, 2008, 1:03pm

Got following error after some adjusting now:

Error: Service check command ‘check_ntutil_wrapper’ specified in service ‘Uptime_Wrapper’ for host ‘C2MMAIN1’ not defined anywhere!

Grtz

Strides · February 15, 2008, 1:10pm

Hi Goldie
Looks like you missed an “_” in the command.cfg, currently defined as:

should be:

(or you could take the ‘missing’ underscore out of your service check…)
HTH
/S

Edit: oh, and you don’t need the -s $ARG3$ in there either…

Goldie · February 18, 2008, 8:24am

Thanx Strides,

The service appeared now, unfortunately is gives an error in my nagios dashboard, saying:

Uptime_Wrapper

UNKNOWN 	02-18-2008 09:17:33 	2d 17h 28m 37s 	3/3 	Numeric values only for warning or critical !

When I check the plugin I can see it is a warning that you have defined when:
( isnnum($o_warn) || isnnum($o_crit)) , but I do not know what the two pipes mean. (damn I feel so dumb when it comes to programming

Grtz

Strides · February 18, 2008, 10:12am

Hi Goldie

Ahhh, you can’t tell me you are not having fun

The double pipes represent a logical OR, so it’s saying if either the supplied values for the warning or critical thresholds are not numbers, then exit with the “unknown” code. How have you written the service check? You don’t need the -w and -c in the service check - if your command_line in command.cfg still looks something like:

command_line $USER1$/check_ntutil_wrapper -H $HOSTADDRESS$ -w $ARG1$ -c $ARG2$

Then the check_command would be looking something like

check_command           check_ntutil_wrapper!85!95

where in the above 85 would be your warning threshold and 95 your critical threshold and get passed through as $ARG1$ and $ARG2$.

HTH

/S

Goldie · February 18, 2008, 11:01am

Yo Strides,

Been checking plugins and looking inside them just to get a better view, but instead of understanding more I feel like going bananas

Ok, I understand now, I wrote it like check_ntutil_wrapper! -w 85 -c 95 and thats why he was saying something about numeric values? -c -w are not
We changed it like ur suggestion, now Nagios says:
Uptime_Wrapper

UNKNOWN 	02-18-2008 11:51:10 	2d 20h 7m 56s 	3/3 	Unable to extract Uptime

:s
I feel tired

Grtz and once again thanks for your patience and wellappreciated help!

Strides · February 18, 2008, 11:30am

Hi Goldie

No problem, it’s a welcome diversion from my, er, day job… 8)

Yeah that numerical check is a bit of code from the source script I adapted - doesn’t really need to be in there at all I guess but now it’s ‘working’ again I guess it may as well stay. Curios though that it cannot parse the uptime out of the response from check_nt…

Can you post the output of

and

(run in libexec or add full path as appropriate)

It was always going to be tricky to get this working without any ‘teething problems’ and no test server at my end, but I reckon we are close…

Cheers

/S

Goldie · February 18, 2008, 11:44am

HI,

Already tried the first command you asked me to do, here is the output:
administrator@s-lam-01:/usr/local/share/nagios/libexec$ sudo ./check_nt -v UPTIME -H 172.20.2.17
[sudo] password for administrator:
Connection refused
could not fetch information from server

And the second one:

administrator@s-lam-01:/usr/local/share/nagios/libexec$sudo perl check_ntutil_wrapper -H 172.20.2.17 -w 85 -c 95
Unable to extract Uptime

Strides · February 18, 2008, 12:02pm

ah, your not getting any response from check_nt then… thats a bit of a problem. Need to get the uptime bit working before the wrapper script has anything to evaluate. I was under the impression that your uptime checks were working alright, or are they working OK in nagios by some other magic?

Cheers

/S

Goldie · February 18, 2008, 12:06pm

Hi,

All the uptime checks for my servers work allrighty. Even the time for the multiple-core servers is OK. Have an excell-file with last reboots in and uptime is similar.
So check_nt!UPTIME works like a charm!

Grtz

Goldie · February 18, 2008, 12:09pm

‘check_nt’ command definition

define command{
command_name check_nt
command_line $USER1$/check_nt -H $HOSTADDRESS$ -p 12489 -v $ARG1$ $ARG2$
}
Is defined in command.cfg
AND

Create a service for monitoring the uptime of the server

Change the host_name to match the name of the host you defined above

define service{
use generic-service
host_name SW2K3DC1
service_description Uptime
check_command check_nt!UPTIME
}
is an example of a working UPTIME service

Uptime

OK 	02-18-2008 13:04:56 	3d 21h 44m 2s 	1/3 	System Uptime - 84 day(s) 2 hour(s) 8 minute(s)   RESULT

Grtz!

Strides · February 18, 2008, 1:07pm

ah ok then, that’s making a bunch of sense and it’s easily fixable…

I suspect

will yield better results from the command line… if so, edit check_ntutil_wrapper and change:

so it reads

and give

another try. If that works, happy days.

I’m keeping my fingers crossed

Cheers

/S

Goldie · February 18, 2008, 1:19pm

Damn Strides,

I should have come up with this myself, the port!..

If you would have given me half an our more… then maybe my heart stopped beating

Will you apply to be a next Nobelprize winner:?:

Works!

There is another one I would like to configure, concerning open handles on an Oracle 10G DB, but I will try to make it work without bothering you this time.
You were of such a great help!

What is your bankaccount nr?

Grtz!