Plugin failure: popen timeout received, but no child process


#1

Hello,

OS: FreeBSD 6.1
Nagios: 2.5

I’m getting the error above rather often and I can’t find any informaiton on it at all. It’s happening in blocks of time and seems to only affect a single host check or host notification alert at a time, less like a OS/resources issue and more like a plugin or nagios issuse. Check it out:

[1162635783] HOST ALERT: gridcontrol;DOWN;SOFT;1;CRITICAL - popen timeout received, but no child process
[1162635793] HOST ALERT: gridcontrol;DOWN;SOFT;2;CRITICAL - popen timeout received, but no child process
[1162635803] HOST ALERT: gridcontrol;DOWN;SOFT;3;CRITICAL - popen timeout received, but no child process
[1162635813] HOST ALERT: gridcontrol;DOWN;SOFT;4;CRITICAL - popen timeout received, but no child process
[1162635823] HOST ALERT: gridcontrol;DOWN;SOFT;5;CRITICAL - popen timeout received, but no child process
[1162635833] HOST ALERT: gridcontrol;DOWN;SOFT;6;CRITICAL - popen timeout received, but no child process
[1162635843] HOST ALERT: gridcontrol;DOWN;SOFT;7;CRITICAL - popen timeout received, but no child process
[1162635853] HOST ALERT: gridcontrol;DOWN;SOFT;8;CRITICAL - popen timeout received, but no child process
[1162635863] HOST ALERT: gridcontrol;DOWN;SOFT;9;CRITICAL - popen timeout received, but no child process
[1162635873] HOST ALERT: gridcontrol;DOWN;HARD;10;CRITICAL - popen timeout received, but no child process
[1162635873] HOST NOTIFICATION: jessed;gridcontrol;DOWN;host-notify-by-email;CRITICAL - popen timeout received, but no child process
[1162635873] SERVICE ALERT: gridcontrol;HTTPS_auth_gridcontrol;CRITICAL;HARD;1;CRITICAL - Socket timeout after 10 seconds
[1162636010] HOST ALERT: gridcontrol;UP;HARD;1;PING OK - Packet loss = 0%, RTA = 0.35 ms
[1162636010] HOST NOTIFICATION: jessed;gridcontrol;UP;host-notify-by-email;PING OK - Packet loss = 0%, RTA = 0.35 ms
[1162636018] SERVICE ALERT: gridcontrol;HTTPS_auth_gridcontrol;OK;HARD;1;HTTP OK HTTP/1.1 200 OK - 1995 bytes in 5.076 seconds

All the ‘popen’ issues seem to follow the same pattern. The plugin consistenly fails to execute due to this issue & notifications for the failure don’t fire off. When the notification finally does work, successful host check are sure to follow. This is not affecting a single host, it seems to jump around to different hosts. It usually only seems to affect a single host at a time, but occasionally it affects multiple hosts. Since Nagios only reports state changes I cann’t tell if there are any successful check during the time in between the HARD down and the HARD UP.

This does not seem to be affecting my sevice checks, only the host checks. All of my hosts are using the ‘check-host-alive’ service check & I’m using the standard version of it: (indentation changed by me for this post)

define command{
command_name check-host-alive
command_line $USER1$/check_ping -H $HOSTADDRESS$ -w 99,99% -c 100,100% -p 1
}

I have noticed in troubleshooting this that the host check spreading doesn’t seem to be working. They all seem to be kicked off at about the same time based on when DNS queries are being made, about every two minutes from my observation. I believe I have the host_check_interleave_method set correctly:

host_inter_check_delay_method=s
max_host_check_spread=30

Anyone have any ideas about this? I’ve searched around and cannot find that error message associated with Nagios anywhere.

thanks,
–jesse


#2

one more thing… I’ve confirmed that all the hosts were fine during these times. No issues at according to both the host logs and the application logs for the time periods that these nagios errors are displayed.

–jesse


#3

downgrade nagios to 2.4 and try that.


#4

I’m using Nagios 2.4 under Debian/Etch - had the same problem yesterday. Just one host alerted all the day with the same error-message (funny thing: it was the host where nagios is running). Still investigating…


#5

I wish I had additional insight. The Monday following this discovery I removed the Nagios binaries that were installed from the source on the Nagios site andreinstalled using the FreeBSD port in FreeBSD 6.1. That was Nov 6th. Since then I’ve received the ‘popen timeout received…’ error message three times in total./ Two of those time was a single host, but it since it only occurred twice it didn’t mark the host hard down. The third time was was yesterday but with only a single occurrence we wouldn’t even have noticed it without my checking this morning.

An intersting point is that the freebsd port is nagios 2.5, the same version I installed from source a few weeks ago when the issue was quite severe. I don’t know what difference, if any, the port would have from the source aside from install locations, etc…

–jesse


#6

I’ve also been having this issue with my install of nagios. We’re running Nagios 2.5 on FreeBSD 6.1. However, our install came from the ports tree.

Anyone have any insight as to what is causing this issue and any solutions?


#7

No real insight but I do have additional information.

The problem became _far_ less pronounced after switching the monitoring system to Ubuntu. The migration had nothing to do with this issue, but the issue is better as a result. Further, during the last instance of this occurring I was able to reproduce the issue by manually calling plugin with the same options that nafios calls it with.  I didn't have time to pursue the issue at that time, but it seemed obvious to me that it was actually a plugin issue, not a nagios issue. I have not had time to go through the code and find out what circumstance causes that error message,  

Knowing where the error was really coming from lead me to the check_ping source code. I'm not really a C guy so figuring out what's going on isn't as easy as I'd like, but basically the check_ping process sends sigalrm in <timeout> seconds, (defaults to 10, you can override by calling check ping with -t). When called the function 'popen_timeout_alarm_handler' kills the process OR prints the error message this thread is discussing if the 'child_process' variable is not defined. 

When I’ve been able to reproduce I was only able to reproduce it when I replicated what nagios was calling check_ping with exactly. If I called check_ping with a different host IP it would work fine, switching back to the destination IP that was giving an error would result in the error occurring immediately, NOT after the timeout value that it’s supposed to occur after.

If I'm able to catch it again I'll be looking for other check_ping processes in the process list that were called for the host I'm getting error messages for. I'll also try to run an strace and possibly ptrace on 'check_ping' when it fails so I can provide the info to the nagios plugins developers. That's as far as I go however. I don't know what would cause this situation and hesitate to guess given my very limited ability with C.

thanks,
–jesse


#8

Try the check_icmp at nagiosexchange, instead of check_ping.


#9

You might try compiling check_ping --without-ipv6, that solved a different problem for me that I was experiencing with check_ping. Also have a read of this thread:

sourceforge.net/mailarchive/mes … d=14887381

“it had to do with the ipv6 host check, where it
would check whether or not the host is an IPv6 address, by calling a
getaddrinfo(), and that would take long enough that SIGALRM would be
tripped, and child_process was NULL at that point.”

-mike