I need some help. We are using Nagios 1.3 for quiet a while and everything works pretty well. There is only one problem that seems to be impossible to get fixed. From time to time Nagios returns “Info: CRITICAL - Socket timeout after 40 seconds” for a service, e.g. for a http check. Then I check the site in my browser and the site is there. Also the server is not overloaded, everything is normal. A few minutes later Nagios reports ok again.
This happens every now and then for all of the checked services, without any regularity. Also the servers are monitored with smokeping and smokeping does not show anything strange. If I increase the timeout with -t then I get something like “Info: (No Information Returned From Host Check)”
are you monitoring the console or is it sedning emails?
On console i had some wrong checks now and then. but having it check 3 times before alerting we never got emails…
Possibly it’s an error on the nagios server… but usually it’s a one time event…
We’ve got two Nagios servers: one that I set up on a box in the local office and one that existed prior to my employment that is on a machine about at thousand miles away (literally, it’s on the other side of the country).
Anywhos, this older Nagios box (running ver 2.0bv3, I believe), will often randomly have critical services with that error, while our local Nagios box reports no problems. We’ve discovered that the issue is a POS ISP that this box is running on. Not sure if that could be your problem, since 1.3 is a stable release and that’d be an error I’d expect from a Beta version.
But to stop the notifications, I’d take Luca’s advice: increase the amount of retries before the notification is sent out.
Yes we already have a max_check_attempts of 3 and a timeout of 40 seconds. Well, I will propably make a test and increase the max_check_attempts even more.
3 checks and 40 seconds timeout is a lot… even too much on an http check…
the only thing i can recommedn is trying to cehck by hand the webserver (from the same subnet where the nagios server is located) and see if there is some transient network problem…