Nagios grinds to a halt and performs no more service checks

eoindickson · October 2, 2008, 6:47pm

I have a Nagios 3.0.3 server running monitoring 200 hosts and 800 services.
It WAS working fine, but since adding some new hosts I have a strange problem which I’ve been unable to fix by restarting or even removing the hosts I’d added. I’ve searched the forum for days and tried tinkering with several config file options but with no success.
What is happening is that when I start nagios all looks normal and it starts checking hosts and services. But it will get so far, maybe check 30% of hosts and 10% of servces and then it seems like no more checks are done, all the remaining hosts/services show as PENDING. Every time I restart its stops processing new checks at a different point. If I look at the service checks that are PENDING they say “Service check scheduled for” and then a date/time that is already past. From then on no new service checks are done. Like I said, if I restart Nagios it starts checking again and then grinds to a halt. Nagios is still running, it just seems to decide to do no more checks.
I’d really appreciate any advice, I’m really stuck with this.
Thanks,
Eoin

eoindickson · October 2, 2008, 9:49pm

One more thing - I’ve just noticed the “Scheduling Queue” display and if I look at that there are hundereds of checks, all saying ENABLED, but all with “Next Check” date/time in the past.

Strides · October 3, 2008, 8:52am

I’d suggest stopping nagios and removing (* renaming/moving) any remaining copied of status.dat or retention.dat from nagios/var/ and also bin anything left hanging about in nagios/var/spool/checkresults/ - you’ll lose all your comments and any scheduled downtime and whatnot but you should get it to restart afresh.

Also check the date/time configured on your nagios server is right. I had similar wierd issues with a 20 min time drift that ntp seemd to miss until all of a sudden it decided to right it and nagios fell over almost immediately. Wierd but at least a restart fixed it in my case…

HTH

/S

eoindickson · October 3, 2008, 4:20pm

Thanks for your suggestions !
I’d already seen a post suggesting to remove status.dat and retention.dat but I did that again anyway. I hadn’t thought about the time issue, in fact ntp wasn’t configured, I have configured it now but that doesn’t seem to have made any difference either.
I started Nagios at 9.10 and if I look at the Scheduling queue there are hundereds of checks there with times ranging from 9.06 to 9.17 - its now 9.18. These checks will now just sit in the queue and never get performed.

Any other ideas ?

Taius · October 30, 2008, 7:16pm

I too have this issue. It happens with both 3.0.3 and 3.0.4. I have nsca running as a daemon (it dies every couple mins and I restart it with cron), PNP, and ndo2db running.

It’s very weird. After a restart, all checks are run once, then never again even though scheduled. My server receives most data from passive checks and it does received them, I can see it in the log. If you look at status detail, last update is recent, but last check is way off.

I’ve been trying to figure this out for a couple weeks now. No headway.

Taius · November 4, 2008, 1:36am

Today I think I found the problem is ndo2db. If I do a stop and start on that service I get another round of service checks. So I guess I need to restart this service every min?

eoindickson · November 4, 2008, 11:21pm

Update - I got so fed up I completely reinstalled. The server then worked ok for a week or so and then the same thing happened. Eventually I norrowed it down to my having max_check_attempts for my hosts set to 5. When I set this to 1 and restarted everything went back to normal. I’m not sure exactly why this worked but everything has been fine since then.