I am new to Nagios and I am facing this problem with the servers we are monitoring. We donot get notification mails or sms bcoz the number of attempts for checking a service does not increment beyond 1/4. It is stuck at 1/4, does not increment like 2/4, 3/4 and finally 4/4 so we donot get notification mails/sms. This problem only occurs when the status information is “Plugin timed out while executing system call” or “Connection refused by host”
what’s the retry interval set to?
Retry interval is set to 1
Whenever we come across this issue the service check hits a hard state at 1/4 which should normally occur at 4/4. I’ve just done a default configuration of nagios; nothing complex.
do the checks get rescheduled?
Yes, the checks get re-scheduled but the number of attempts does not increment.
I also want to know, how to execute service checks irrespective of the host status. Normally service checks are disabled when the host goes down or is unreachable.
Is your problem happening on ALL service checks or only some particular ones?
What use does it have to make a service check on an unreachable or down host?
I think it can’t be done without modifying the source code of the nagios scheduler.
Its happening on all service checks.
I know its stupid to execute service checks on down or unreachable host but my superiors want it that way so please let me know if there is a way.
Explain them it’s simply STUPID, if the host is down you are only hogging the nagios machine with useless checks (which i expect they can see will have no effect whatsoever). Decrease the hosts check alive interval if you want (it does a check_ping so are already doing a service check if taht can be used as an excuse) as i said i doubt it can be done without modifying the source of the scheduler.
back to the plugins:
It looks like nagios can’t get a valid result from those plugins so there really is no CRITICAL state to be counted.
Try running the plugins from command line as user nagios and see if you can get a result. It’s probably some misconfiguration or firewall problem (for the second type of error), i.e. you can’t make a check_ssh if you don’t have ssh access through some firewall.