We have a problem that frequently pops up in our monitoring environment. Our monitoring server has adequate bandwidth but we share with an occasional bandwidth hog. Sometimes they steal the bulk of our available bandwidth causing Nagios to be unable to adequately reach the systems it monitors. What results is an onslaught of hundreds and thousands of pages to everyone telling us that systems and services are down… one page for each that is monitored! Sometimes more if the bandwidth is pinched for a prolonged period of time. Legitimate pages are lost in the process.
Is there a way to put a threshold on Nagios if “Time Out” messages are received from a certain number of hosts or services? If 4 distinct sites are timing out at the same time it should stop all notifications for all hosts and services and send a single page indicating a bandwidth issue. Is there any facility in Nagios that even slightly resembles what I’ve described? TIA!