Nagios acts up after phase of high load


#1

Hi Guys,

we’ve been running nagios for a few weeks without any major
hassles, but in the last 5 hours it got its knickers in a twist twice.

When the server it is running on had a pretty high load for a few
minutes it fired alerts that pings timed out or the smtp agent
couldn’t run in a given time … however, those notifications
didn’t go away after the load went back to normal (they stay
for hours if I let them). The only way to get the status display
clean again is to stop and start nagios (which isn’t really that
great a solution). Is that behaviour normal? Am I just expecting
to much? Or is there a configuration key I need to modify
to have it tidy up itself?

Cheers,
Tink


#2

When the crap hit’s the fan, nagios can really get backed up. I’m not sure how to deal with it. I’m running over 1000 checks every 5 minutes, and the CPU load is very high. So when a check fails, there may be 300 other checks that will fail also. I have parent hosts defined, but am not using host/service dependencies. I found that the host/service dependency take too long to figure out where the problem is at. Sooooo, a service fails, then I get tons of “unreachable”, things just get real ugly.

Bottom line, I need more help on this subject too.