Serices changing or completely disappearing?

Brian_D · June 26, 2008, 4:18pm

I have a weird problem that seems to have shown up after upgrading to Nagios 3.01. I have tried Mozilla and IE to confirm it is not a browser problem. On one of my windows servers, I am monitoring 3 drives. 2 of the drives are having issues. The seem to randomly rotate between states or sometimes not even show up. Sometimes it will detect them correctly and report them as it i supposed to. I can then click refresh and the status will say "(Return code of 139 is out of bounds) ". What is weird about it is that the duration will be something like 9 days 3 hours. 5 seconds before that, it said the status was fine. Other times I can click refresh and the 2 drives will not even show up in the services list.

I have checked my ip address on the host as that was a cause of an earlier problem with these drives not being detected.

I have tried changing my values to 81 91 also. I always verify and restart after every change.

Any ideas on what could be causing this one?

define command{
command_name check_nt_disk
command_line $USER1$/check_nt -H $HOSTADDRESS$ -s password -p xxxxx -v USEDDISKSPACE -l $ARG1$ -w $ARG2$ -c $ARG3$
}

define service{
use server-service
host_name rdydap1
service_description F:\ Disk space
check_command check_nt_disk!f!80!90
}

Thanks

Brian_D · June 26, 2008, 4:35pm

Looking at it again, the changing services are not isolated to just those 2 drives. I can go to my tactical screen where I know I should have at least 2 services that are in critical states, however, 0 ar ebeing shown. I click on refresh and they are there. Another refresh and they are gone.

Something is seriously messed up, but I have no idea what. It makes it very difficult to rely on a program that is acting this flaky. Should I downgrade back to Nagios 2.x, or can this be worked out?

Brian_D · June 26, 2008, 4:46pm

Doing some troubleshooting, I completely stopped the Nagios service. I still get different results every time I refresh the screen. This leads me to believe there is a problem with the log files, apache2 or the web browser.

I am tempted to clear the status.log and nagios.log, but I do not know what the results would be.

Strides · June 27, 2008, 11:11am

This is indictative of 2 nagios processes running. Try stopping nagios and then “killall -9 nagios” a couple of times before restarting, to make sure there is no second process running.

HTH

/S

Brian_D · June 27, 2008, 3:18pm

I do believe you are correct. While lookign through the forums here yesterday, I started thinking the same thing, however, I was unsure of how to stop the services. I rebooted the entire server and it seems to be working much better after that. I also see thta I no longer have so many orphaned checks in the event logs.

I will keep an eye on it, but I think it is better now. Thank you for the help.