Hi,
I’ve newly installed Nagios and am getting intermittent false service alarms (example shown below).
I’m only monitoring network devices so for every host I have created a service to ping the device.
Any ideas why I’m getting these false alarms?
Many thanks,
John
[09-23-2005 08:44:39] SERVICE ALERT: ird-me04-if-sat.sta;check-host-alive;CRITICAL;HARD;1;PING CRITICAL - Packet loss = 5%, RTA = 259.24 ms
define command{
command_name check-host-alive
command_line /usr/local/nagios/libexec/check_ping -H $HOSTADDRESS$ -n 10 -w 440,15% -c 500,25%
}
define service{
host_name *
service_description check-host-alive
check_command check-host-alive
max_check_attempts 3
normal_check_interval 5
retry_check_interval 1
check_period 24x7
notification_interval 30
notification_period 24x7
notification_options w,c,r
contact_groups all
}
A few things I don’t like.
-n what is that? I don’t see a -n when I do ./check_ping --help.
Your command definition does not use the $USER1$. Not that this is a problem, but it’s there for you to help you save some typing. When you have over 1000 services defined, this will become important.
command_line $USER1$/check_ping -H $HOSTADDRESS$ -w 3000.0,80% -c 5000.0,100% -p 1
The service check command you are using is the one meant for checking a host. Remember, a ping to one device, may have to be “tweaked” in comparison to another device, but you have no way to do that, since you have no “$ARG1$” macros defined in your service check. In other words, please use the check_ping command definition for a service, or better yet, check_fping.
‘check_fping’ command definition
define command{
command_name check_fping
command_line $USER1$/check_fping -H $HOSTADDRESS$ -w $ARG1$ -c $ARG2$ -n 2
But, I see no reason as to why your check is going critical, other that the fact that there is no “-n” switch.
Many thanks for this.
I’ve changed to use the standard command definition for check_ping. I got the -n muddled up with -p. So far, so good.
The issue I’ve got is that I’m deploying this in an ISP network. For 99% of hosts there are no services to monitor, just connectivity. In this scenario, would I use a ping command for both the host and the service?
Use the standard check_host-alive for hosts, and fping for services.
or check_ping, if you don’t have check_fping.
But tweek check_ping to get the best performance for you.
To understand what I am talking about tweeking, see…
nagios.sourceforge.net/docs/1_0/tuning.html and look at the section:
Optimize host check commands.
That should have said optimize service check commands.
Thanks again, I’ll try out fping. I’m still getting the same problem with check_ping. I’m not getting host alerts, just service alerts.
I have tried -p on both 1 and 10 but it doesn’t make any difference. Do you think it’s the plugin or nagios itself that is misinterpreting the data?
[09-23-2005 16:18:28] SERVICE ALERT: ird-asia05-if-sat.sta;check-host-alive;CRITICAL;HARD;1;PING CRITICAL - Packet loss = 5%, RTA = 262.28 ms
define command{
command_name check-host-alive
command_line /usr/local/nagios/libexec/check_ping -H $HOSTADDRESS$ -w 5000,100% -c 5000,100% -p 10
}
Are you making these changes but yet not restarting nagios? If so, you can’t do that. Whenever you make ANY change, you have to restart nagios.
Yes I have been restarting each time.
please install fping and then compile the check_ping plugin. Use that instead of check_ping.
Thanks again. I’ve installed fping and check_fping but I am still getting the same problem. One thing I’ve noticed is that the plug in is returning strange results. I’ve changed the command temporarily to log to a file:
define command{
command_name check-fping
command_line /usr/local/nagios/libexec/check_fping -H $HOSTADDRESS$ -w 5000,100% -c 5000,100% > /usr/local/nagios/var/fping-$HOSTADDRESS$
}
And when I start nagios like this I see the following in the file:
srv-ftp01# cat fping-210.5.196.19
FPING WARNING - 210.5.196.19
but if I run the command from the command line I get:
srv-ftp01# /usr/local/nagios/libexec/check_fping -H 210.5.196.19 -w 5000,100% -c 5000,100%
FPING OK - 210.5.196.19 (loss=0%, rta=251.000000 ms)|loss=0%;100;100;0;100 rta=0.251000s;5.000000;5.000000;0.000000
This is happening on all services.
Is this normal that when Nagios runs the check_fping I am getting different results? Thanks.
You are running the command from a shell as root most likely and under nagios, it’s running as user nagios. Check the permissions on the fping file and also all of the plugins.
which fping
then ls -la that file and check permissions.
You also may have some strange timeout settings in your nagios.cfg file.
Mmmm there’s the problem:
srv-ftp01# /usr/local/nagios/libexec/check_fping -H 210.5.196.19 -w 5000,100% -c 5000,100%
FPING OK - 210.5.196.19 (loss=0%, rta=251.000000 ms)|loss=0%;100;100;0;100 rta=0.251000s;5.000000;5.000000;0.000000
srv-ftp01# su - nagios
-su-3.00$ /usr/local/nagios/libexec/check_fping -H 210.5.196.19 -w 5000,100% -c 5000,100%
FPING WARNING - 210.5.196.19
-su-3.00$ ls -al /usr/local/nagios/libexec/check_fping
-rwxr-xr-x 1 nagios wheel 157988 Sep 24 08:02 /usr/local/nagios/libexec/check_fping
I’ll have a dig around
-su-3.00$ fping 210.5.196.19
This program can only be run by root, or it must be setuid root.
permissions on the libexec/check_fping are wrong.
Also, do a
which fping.
then ls -la on that file to get permissions.
for example.
which fping
/usr/local/sbin/fping
ls -la /usr/local/sbin/fping
-rwsr-xr-x 1 root root 42379 Jun 9 09:40 /usr/local/sbin/fping
Hi, it needed setuid root. Schoolboy error. Cooking with gas now…thanks for the help!