Nagios Freezing


#1

Hi,

I’ve setup a few Nagios boxes but this one really takes the cake. Its a Fedora Core 4 Box running apache 2.0.55 and Nagios 1.3.

It has been running fine for a few days then all of a sudden it froze up, even after a reboot it wont start checking hosts, it just sits there doing nothing.

I’m really at a loss with this, I had a nagios 1.2 box that has been up and running for about a year with no problems but this one wont work. I’ve tried version 2.0 but this just wont work at all. Im wondering if it is to do with FC4? As before I used FC3.

If I grep the proccesses this is what I get:

nagios 2369 0.0 0.1 2604 1588 ? Ss 11:02 0:00 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
nagios 2758 0.0 0.1 2608 1588 ? S 11:07 0:00 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
nagios 2759 0.0 0.0 3784 688 ? S 11:07 0:00 /usr/local/nagios/libexec/check_ping -H 10.32.44.48 -w 3000.0,80% -c 5000.0,100% -p 15 -t 20
nagios 2760 0.0 0.0 1728 536 ? S 11:07 0:00 /bin/ping -n -U -w 90 -c 15 10.32.44.48
root 2767 0.0 0.0 3756 692 pts/0 S+ 11:08 0:00 grep nagios

There seems to be something very wrong but its a complete mystery what.
Can anyone help?

Thanks.


#2

Hm…rather interesting…how do you know that Nagios isn’t checking anything? Looks to me that, from the output of your process grep, Nagios is checking to see that a host is up…

I’m pretty sure that the issue isn’t with FC4…it might be with the way it was installed, but I’ve got FC3 and FC4 running, both of them with Nagios on it, and things are running smoothly.


#3

possibly not enough disk space? it’s the only thing i remeber which hangs nagios and doesn’t get solved by a reboot…

Luca


#4

Thanks Guys i’ve checked the disk space and it looks fine.

I investigated further and the checks were happening, just very very slowly, when two hosts went down it appeared to get stuck on them.

/usr/local/nagios/libexec/check_ping -H 10.32.44.48 -w 3000.0,80% -c 5000.0,100% -p 15 -t 20

I think it was where it was checking a host using the check-host-alive command It was taking ages, so I changed it to this.

‘check-host-alive’ command definition

define command{
command_name check-host-alive
command_line $USER1$/check_ping -H $HOSTADDRESS$ -w 3000.0,80% -c 5000.0,100% -p 5
}

Now it only ping 5 times rather than 15, so its not waiting for 15 icmp packets, this worked fine on the old config but on here didnt.

Its running okay now, but not sure if this is what fixed it or not. Its completing all 313 checks in 5 minutes or less which is good, before it wasnt even doing half that in an hour!

Could someone post their checkcomands.cfg file for me to have a look at check_ping
and check_host_alive directives to see what you have used for them?
It think this might be where the problem lies.

Thanks guys.


#5

15 is way too much.
i’d use 3… not more…

you have a warning at 80% packet loss… that’s already worth a critical in my opinion. i’d use 3 packets with warning at 40% (2 packets lost) and set retry_interval to 1 with 3 retries. this would cause a notification to be sent out if 3 checks fail… that means in 2 minutes 6 packets out of 9 got lost…

Hope it is clear enough :smiley:

Luca


#6

define command{
command_name check_ping
command_line $USER1$/check_ping -H $HOSTADDRESS$ -w $ARG1$ -c $ARG2$ -p 2
}
Yes, -p 15 was excessive.
To see why I changed to -p 2, read the tips here:
nagios.sourceforge.net/docs/1_0/tuning.html
Item #7 is the one in particular.


#7

Thanks guys it seems to be behaving itself, im using 3 packets because almost all the checks are going via a vpn across the net to check all these hosts. There can be some internet flappage, so ive used slightly higher tollerances so nagios doesn’t cry wolf to often.

It all seems to be okay for the moment though.

Check Execution Time: 0 / 10 / 0.640 sec
Check Latency: 0 / 2 / 0.108 sec

Active Checks: 314

Passive Checks: 0

So its looking good.

Thanks