I’ve been experiencing an issue with nagios stalling after a certain amount of time in processing external commands. When I restart nagios it will immediately accept external commands. I tail the nagios.cmd file, and I see the entries are getting put in there by my various scripts, and I tail /var/spool/nagios/nagios.log, and see that nagios is processing the entries (EXTERNAL COMMAND: PROCESS_SERVICE_CHECK_RESULT -> SERVICE ALERT -> SERVICE NOTIFICATION). However, after a certain amount of time, I haven’t figured out how long, but it seems to be as little as a half hour, nagios stops processing my external commands.
The entries are still be placed into the nagios.cmd file, and nagios appears to be processing the file and removing the entries from it, as usual, but nothing appears in the nagios.log, and no notifications are sent. I then try to send a bunch of external commands through (all PROCESS_SERVICE_CHECK_RESULT, fwiw), and none of them results in a nagios.log entry, or an alert being sent, even though they’re all Critical, no volatile checks (snmp traps).
In any case, restarting nagios fixes the problem, but that’s not a fix at all.
Has anyone experienced something like this? Any clues as to where I might look to resolve this?
The server load is as close to nothing as possible. It’s a new installation, with only two devices checking one or two services each. It’s mainly a test system to streamline our snmp trap checking process.
As to your second question, as to whether nagios is still working, that’s sort of what I’d like to know how to determine. The process is still running, certainly, and there’s no process locked up in the Scheduling Queue. What might I do to determine if the entire nagios process is locking up? Is there some other log file I might look at that would list the most recently processed task, or the status of that task?
You say that you only have 2 devices(hosts) and 1 or 2 service checks each. If that is the case, then why is there ANYTHNING at all being written to the external command file?
Now, if you are saying that after awhile, you attempt to force an immediate service check, and that at first it works, but then after an hour, it won’t, then that is different.
It’s quite possible that you have a bad config setting or as luca stated, it’s hung up on a normal service check.
vi nagios.cfg
check_external_commands=1
command_check_interval=-1 #notice that it is set to MINUS 1
service_check_timeout=60
To find a service check that is hogging the system or taking a long time to execute, look at each of the services in the web and look at:
Check Duration: < 1 second You may have one that says 120 seconds or some such thing.