My Nagios installation has stopped checking services! I think host checking has also stopped as I haven’t recieved any notification e-mails regarding down hosts for about a week!
I have to manually schedule a service check in order for it to occur.
Service checks are enabled on the web interface. Duration is listed as ??? on each service.
I am using Nagios 3.0.3 on Ubuntu Server, with Nuvola Style theme. I haven’t made any config changes that would cause such a thing to happen!
Is this behaviour constant ?
Or does it just happen from time to time ?
If it just happens sometimes … well, I can’t help you much: I have the same problem with a few of my nagios servers: sometimes (one or twice per month), the just stop doing their checks. Everything looks normal (processes present, web interface up and running) except that nagios is doing nothing
=> the “solution”: we had to implement a “Heartbeat” test which is sent (by SNMP) to Netcool, which raises an alert if 2 heartbeats are absent…
Sorry, I’ve never seen this problem…
maybe someone else will come up with a solution.
In the meantime, I don’t have many ideas, apart from restarting nagios, or re-installing nagios, or hitting your server box until it starts working again
I just ran into this over the weekend. The only difference is that I did not have ??? as the duration. It appeared as if Nagios just quit checking things. The scheduling queue has checks scheduled for the day before but wasn’t getting anything done. I restarted nagios (and the server itself), but didn’t have any luck. I ended up disabling all active checks for my “All” hostgroup and then re-enabling and that kicked everything off again. Still trying to find the root cause of this…
I had a problem with the 3.0.b4 beta but not so far with the 3.0.3 release. What I found was that nagios would get stuck doing a check_snmp and hang. I also founf that the killproc_nagios in /etc/init/nagios was not dependable killing off the nagios process. My response was to add the following to the /etc/init/nagios:
kill_nagios ()
{
kill -9 ps -U nagios -o "pid="
}
and to modify ‘stop’ to include a call to that:
stop)
echo -n "Stopping nagios: "
After some more digging, I remembered that I reboot my server each Friday night during our scheduled network maintenance window. During this window, I don’t check anything except for a few services on the Nagios server itself.
So, I’m only checking a few services, then I reboot and afterwards, the local services still get checked, but nothing else does. I ended up changing the reboot of the server to the Friday lunch hour so I could keep an eye on it and now everything seems to be ok. Not sure why that combination of things caused this problem.