check_http timeout

Hi everyone.

We have been using Nagios for about a year now for our production environment. Primarily we use this to monitor URL’s for our web services. One thing I have noticed is we tend to get some false alarms, but in a weird fashion. We check the service every 2 minutes, and alert after 5 criticals. On many occasions we will get a fail message, and then on the 6th try it will recover. I have noticed in a few instances that we have had some dns timeouts from our hosting provider, but it seems very odd that on multiple occasions this happens. I would say 20% or so of our alerts are of this nature, but I’m not sure how to investigate. The service is up and running when I check manually from another source. Could there be sockets being held open and choking the new requests? I was just wondering if anyone experienced something similar.

Check the load of Nagios server. Maybe it is overloaded in some periods and then the check_plugin times out.

One thing I forgot to mention is that we have multiple web server instances running on a server, and we will only get a delay for one site. When we have a load issue I hear about it from all sites. It just doesn’t make sense that the first 5 fail, but it always comes back on the 6th.