I’ve got Nagios 3.0.2 running on an old Dell Poweredge 1300 server. The server currently monitors almost 130 servers, most of which are http requests to ensure that binds between my front-end and back-end servers are still running. We noticed this morning that the Nagios host was no longer issuing alerts when there was a problem. We tried manually sending mail from command line on the Nagios box, and that did not work either.
Suspecting server resource consumption to be the issue, I checked the processes running and checked that against my service configs and found that all of the 120+ services were scheduled to be checked every minute. I re-configured Nagios to check the services every 10 minutes, and suddenly the server was able to send mail and alerts.
Now, my question: is there a way to intentionally stagger the service checks, to say that I want this service checked 2 minutes after the preceeding server or perhaps on the 10th minute of every hour? My worry is that in the event of a total data center outage, Nagios will find all hosts/services down and start re-checking them all every minute (current check retry is set to 1 minute), which would likely put the server under enough load that it would not be able to process e-mail/alerts. Aside from a big hardware upgrade (which is sorely needed), are there any suggestions?