"Service check did not exit properly"


#1

Near as I can tell, nagios logs a failed service check with “Service check did not exit properly” when it can’t execute the service check plugin. Kinda a poorly worded log message…

I keep getting this failure at random, 'bout once a day using check_snmp_if and I can’t discern why.

I first thought it might be a timeout, but nagios logs those as timeouts.

I then thought it was a process limit, but I’m so under limits it hurts.

So, I got cute, I made a wrapper that logs when it starts, and any output from check_snmp_if or the exact error when it can’t execute it. Welp, it happened again, and I don’t see anything in my logs to indicate what happened, all I see are valid runs.

Anyone have further diag suggestions?


#2

use check_snmp with the correct options rather than check_snmp_if. Reason is that it’s faster. Make sure to specify the correct mib with the -m option.


#3

Ok, that still doesn’t explain the error condition. My concern is something else is going on, and I’m just going to bump against it later if I swap plugins now.

Not to mention check_snmp just reports one stat (in this case OperStatus) vs check_snmp_if which also looks at AdminStatus when deciding to alert or not.
Edited Fri Nov 04 2005, 10:56AM ]


#4

More data…

1cc-r2 atm 4/1/0.1-aal5 paged at 17:45:29 reporting “Service check did
not exit properly”

Looking at the logging from our inhouse plugin wrapper, I have the
following data:

Tue Nov 8 17:40:20 2005:
COMMAND: /usr/local/libexec/nagios/check_snmp_if_2 -H
192.168.1.252 -C public -i 17 2>&1
OUTPUT: OK: Admn:up; Oper:up;
RETURN_STATUS: 0

Tue Nov 8 17:45:20 2005:
COMMAND: /usr/local/libexec/nagios/check_snmp_if_2 -H
192.168.1.252 -C public -i 17 2>&1
OUTPUT: OK: Admn:up; Oper:up;
RETURN_STATUS: 0

Tue Nov 8 17:45:32 2005:
COMMAND: /usr/local/libexec/nagios/check_snmp_if_2 -H
192.168.1.252 -C public -i 17 2>&1
OUTPUT: OK: Admn:up; Oper:up;
RETURN_STATUS: 0

Tue Nov 8 17:50:32 2005:
COMMAND: /usr/local/libexec/nagios/check_snmp_if_2 -H
192.168.1.252 -C public -i 17 2>&1
OUTPUT: OK: Admn:up; Oper:up;
RETURN_STATUS: 0

So, the check DID infact exit cleanly with appropriate data.

It looks like it didn’t get the exit data, then spawned a new check right after, which generated a recovery page.

nagios: 1.2
check_snmp_if: 0.3.5
FreeBSD: 4.11-p12

I’m now really stumped, at first I thought something was causing the script to execute, now it appears the script is executing, but something is breakng the communication between nagios and the plugin.


#5

This has been resolved locally; see my post in meulie.net/forum_viewtopic.p … 472.0#4516