Nagios acts up after phase of high load


Hi Guys,

we’ve been running nagios for a few weeks without any major
hassles, but in the last 5 hours it got its knickers in a twist twice.

When the server it is running on had a pretty high load for a few
minutes it fired alerts that pings timed out or the smtp agent
couldn’t run in a given time … however, those notifications
didn’t go away after the load went back to normal (they stay
for hours if I let them). The only way to get the status display
clean again is to stop and start nagios (which isn’t really that
great a solution). Is that behaviour normal? Am I just expecting
to much? Or is there a configuration key I need to modify
to have it tidy up itself?



When the crap hit’s the fan, nagios can really get backed up. I’m not sure how to deal with it. I’m running over 1000 checks every 5 minutes, and the CPU load is very high. So when a check fails, there may be 300 other checks that will fail also. I have parent hosts defined, but am not using host/service dependencies. I found that the host/service dependency take too long to figure out where the problem is at. Sooooo, a service fails, then I get tons of “unreachable”, things just get real ugly.

Bottom line, I need more help on this subject too.