The scheduler stops executing checks. The occurrence of this issue is subtle, as all CGI web interface browsing functions remain normal (you can access the tactical overview, service detail etc. no problem.) The problem becomes apparent when attempting to schedule a check
twas the size of the service perf file. log rotations are setup and it’s running reliably again.
"service-perfdata.dat file size: 2.0Gb. Is this OK?"
No! You’re using nagiosgraph and a map file right? You need to set in your nagios.cfg service_perfdata_file_mode=w rather than =a. Nagiosgraph is trying to parse a 2gb file every service check, and as a result the nagios proc isn’t starting any more service checks until it’s done. change the mode to “write” or this problem will occur again when your perfdata log gets big.
Note for other people reading this thread: nagios 2.2 or 2.3 and prior had a bug where it mismatched append mode and write mode, so if it’s already in write mode, switch it to append and see if the perfdata file stays real small.
MP, thank you for your expertise on this. I hadn’t had either of the options enabled, but now I’ve applied the write mode and am testing the change. I’ll post back with the results soon.