I’m currently monitoring 458 servers running 6625 passive checks which are executed every 5 minutes on the client side. I started receiving "WARNING: stale passive check result: something’s not right " when i added 100 hosts to Nagios. My freshness_threshold was set to 360 since i’m running checks every 5 minutes, this gives 60 seconds to any checks which don’t come on time before they get flagged. With so many servers and passive checks i had to increase the freshness_threshold to 1200, otherwise i get a lot ~600 check which are flagged as stale. I have tried everything to figure out what’s going on but i can’t seem to find what’s causing the delay. I’m running Nagios on 8 CPU box with 32GB of ram on 100mb connection. I’m not seeing any resource saturation even on 100mb card since the nic card load doesn’t go above 40%. It seems to me that the Nagios daemon, can’t process that much info since most of the results come in at the same time. Does anybody have any ideas on how I can resolve this issue. I don’t want to keep the freshness_threshold at 1200 since i can miss important alerts.
Any help is very much appreciated. I’m pretty much stuck.