My company uses Nagios plugins to monitor availability of different target servers from various locations. By using check_http’s ‘eregi’ option we search for some strings in target pages. Since our checks are very frequent, and from time to time there are a lot of them, we need optimal performance of check_http process. In peak time we have up to 1000 checks every minute.
The problem seems to be in excessive memory usage of check_http process when the target page is too large. For example, if the page is larger than 500KB, check_http process can occupy even 20% of the memory, on 1GB RAM servers. At moments when we have several simultaneous checks, the whole server memory gets exhausted, and the server becomes very unstable. Most of the time it recovers after some time, but we experienced situations where we had to restart the whole server because it was in non-responsive state.
We also noticed that this problem is also connected with the time needed to download the target page. So, if either of communicating servers is slow, the possibility for the “memory” problem to happen is much higher. For example, we tested one large page was located on server in Italy. When using high bandwidth servers from Europe or United States, where loading was very quick even for large pages, we had few “memory” problems. But with servers located in India or Australia, we have constant problems.
We even managed to simulate the problem by spawning 20 check_http processes, just to be sure that the problem does not lie in our part of software. The result was 100% memory usage by check_http processes, and a server that was dead for 15 minutes.
OS: RedHat Enterprise Linux 4
Nagios plugins version: 1.4.5
check_http version: 1.96
Any help would be appreciated.