Hello,
OS: FreeBSD 6.1
Nagios: 2.5
I’m getting the error above rather often and I can’t find any informaiton on it at all. It’s happening in blocks of time and seems to only affect a single host check or host notification alert at a time, less like a OS/resources issue and more like a plugin or nagios issuse. Check it out:
[1162635783] HOST ALERT: gridcontrol;DOWN;SOFT;1;CRITICAL - popen timeout received, but no child process
[1162635793] HOST ALERT: gridcontrol;DOWN;SOFT;2;CRITICAL - popen timeout received, but no child process
[1162635803] HOST ALERT: gridcontrol;DOWN;SOFT;3;CRITICAL - popen timeout received, but no child process
[1162635813] HOST ALERT: gridcontrol;DOWN;SOFT;4;CRITICAL - popen timeout received, but no child process
[1162635823] HOST ALERT: gridcontrol;DOWN;SOFT;5;CRITICAL - popen timeout received, but no child process
[1162635833] HOST ALERT: gridcontrol;DOWN;SOFT;6;CRITICAL - popen timeout received, but no child process
[1162635843] HOST ALERT: gridcontrol;DOWN;SOFT;7;CRITICAL - popen timeout received, but no child process
[1162635853] HOST ALERT: gridcontrol;DOWN;SOFT;8;CRITICAL - popen timeout received, but no child process
[1162635863] HOST ALERT: gridcontrol;DOWN;SOFT;9;CRITICAL - popen timeout received, but no child process
[1162635873] HOST ALERT: gridcontrol;DOWN;HARD;10;CRITICAL - popen timeout received, but no child process
[1162635873] HOST NOTIFICATION: jessed;gridcontrol;DOWN;host-notify-by-email;CRITICAL - popen timeout received, but no child process
[1162635873] SERVICE ALERT: gridcontrol;HTTPS_auth_gridcontrol;CRITICAL;HARD;1;CRITICAL - Socket timeout after 10 seconds
[1162636010] HOST ALERT: gridcontrol;UP;HARD;1;PING OK - Packet loss = 0%, RTA = 0.35 ms
[1162636010] HOST NOTIFICATION: jessed;gridcontrol;UP;host-notify-by-email;PING OK - Packet loss = 0%, RTA = 0.35 ms
[1162636018] SERVICE ALERT: gridcontrol;HTTPS_auth_gridcontrol;OK;HARD;1;HTTP OK HTTP/1.1 200 OK - 1995 bytes in 5.076 seconds
All the ‘popen’ issues seem to follow the same pattern. The plugin consistenly fails to execute due to this issue & notifications for the failure don’t fire off. When the notification finally does work, successful host check are sure to follow. This is not affecting a single host, it seems to jump around to different hosts. It usually only seems to affect a single host at a time, but occasionally it affects multiple hosts. Since Nagios only reports state changes I cann’t tell if there are any successful check during the time in between the HARD down and the HARD UP.
This does not seem to be affecting my sevice checks, only the host checks. All of my hosts are using the ‘check-host-alive’ service check & I’m using the standard version of it: (indentation changed by me for this post)
define command{
command_name check-host-alive
command_line $USER1$/check_ping -H $HOSTADDRESS$ -w 99,99% -c 100,100% -p 1
}
I have noticed in troubleshooting this that the host check spreading doesn’t seem to be working. They all seem to be kicked off at about the same time based on when DNS queries are being made, about every two minutes from my observation. I believe I have the host_check_interleave_method set correctly:
host_inter_check_delay_method=s
max_host_check_spread=30
Anyone have any ideas about this? I’ve searched around and cannot find that error message associated with Nagios anywhere.
thanks,
–jesse