My company has really gotten into Nagios in a big way. We are rolling out service monitors for >750 hosts. I will end up with a minimum of 4 service checks per host, so I am looking at > 2700 processes on a 15 minute cycle.
I am running Nagios 3 in a Debian VM in Xenserver. It is the only VM on a server with 8 cores and 72 Gb RAM. It is a RAID 5 box with 6 15000 RPM drives. In other words a reasonably good sized Dell R710. I am rolling this out now and I am up to about 1200 services right now with no apparent impact to the box. My average execution time is 4.496 seconds against 2700 active service checks in a 15 min period. Is there a metric I can watch on the server that will tell me how much headroom I have? If I fill up this server as planned am I going to have performance problems? Would I be better off with two servers? I had presented this project as requiring two servers to minimize traffic between cages in our DC - we have two locations with ~375 machines per cage. The cages are connected by a fiber link but we already pass quite a bit of data across that link, and the majority of it is more important than monitoring data. I am leaning toward another server but need some metric based arguements for management. Any help is greatly appreciated.