I am beginning the process of expanding our Nagios monitoring platform. I currently use Nagios to monitor approx. 650 hosts. Of these the vast majority are simply monitored for up/down - i.e. ping checks. I do monitor a handful of machines for various services, disk space, port access, etc, but it is only a very small percentage of what I have used Nagios for up to this point. However, management is asking that we expand our monitoring to watch for several conditions on a much larger number of machines. These conditions are disk % used, cpu, memory used, and swap file usage.
It would help me a great deal to have some standard that management would be able to accept on how frequently I need to monitor all of these conditions. I understand that for the most part this is a subjective issue - depending on what the server does and what its level of use may be. However I was hoping there might be some standard that you have either found or created that states how frequently these should be checked? This is obviously a balance between the capability of Nagios to execute these checks, the network and server activity that the checks will introduce, and the risk/benefit of more or less frequent checks. I imagine many of you have done similar exercises. Can you give me your guidelines or insight? THanks