Monitoring Benchmarks

joe1871 · March 28, 2011, 6:05pm

Hi

I am beginning the process of expanding our Nagios monitoring platform. I currently use Nagios to monitor approx. 650 hosts. Of these the vast majority are simply monitored for up/down - i.e. ping checks. I do monitor a handful of machines for various services, disk space, port access, etc, but it is only a very small percentage of what I have used Nagios for up to this point. However, management is asking that we expand our monitoring to watch for several conditions on a much larger number of machines. These conditions are disk % used, cpu, memory used, and swap file usage.

It would help me a great deal to have some standard that management would be able to accept on how frequently I need to monitor all of these conditions. I understand that for the most part this is a subjective issue - depending on what the server does and what its level of use may be. However I was hoping there might be some standard that you have either found or created that states how frequently these should be checked? This is obviously a balance between the capability of Nagios to execute these checks, the network and server activity that the checks will introduce, and the risk/benefit of more or less frequent checks. I imagine many of you have done similar exercises. Can you give me your guidelines or insight? THanks

luca · March 29, 2011, 8:48pm

personally i use 5 minutes, or 10, depends on the machine. but i have my nagios setup mainly in a non profit environment with no big money involved. should i have to check as an example a data center core switch i’d use every minute, not more, and only for very few selected hosts.

I think that graphing the data is also very important to get the trends, or at least for some post factum assessment… Problem is that if there’s too much data nobody will look at it “proactively” so decide beforehand what is useful data and what not, an dpossibly create a spearate webpage to see the important data.

Hope this helps somehow