Distributed monitoring setup and performance

marac · September 4, 2009, 2:41pm

Hi guys,

I have a large setup which contains ~2100 nodes, ~19000 services. The monitoring is done by 1 central servers and 3 “slaves”, each of the slaves has roughly ~700 nodes and ~6000 services, but it’s not an exact number. All the communication between the central server and slaves is done with NSCA plug-in and enabled passive checks for services on the master, active checks are enabled only for hosts checking on the master, on the slaves is everything in active mode. I use NRPE client on all the nodes to do most of the jobs locally except of ping check. The version of Nagios, which we have been using so far is 3.0.6 with use_large_installation_tweaks parameter on, but I can still see a significant latencies, to check the whole cluster, nodes only takes 2m and 30 seconds which is sufficient, but than I was measuring the time for all the service checks, for ~ 7000 services it takes 7 minutes, for ~19000 around 9 minutes, I was looking at the load of the master and slaves, and the masters are just idle , which means the load is 0.05, 0.01, 0,04 (slave1, slave2, slave3), next strange thing to me is , that a count of the service checks is a following:

16:44:02 422 hosts 55 services
16:44:22 788 hosts 99 services
16:44:44 1298 hosts 182 services
16:45:19 1822 hosts 272 services
16:46:01 2082 hosts 431services
16:47:01 2082 hosts 790 services
16:48:00 2082 hosts 1145 services
16:49:00 2082 hosts 1437 services
16:50:00 2082 hosts 5735 services
16:51:01 2082 hosts 6853 services

I haven’t done measurements with 19000 services, but it would be pretty much the same, but a bit longer.

When I checked the slaves, I can find only one process running check_ and send_nsca command. I did a various combination using dumb instead of smart method of checks for services and hosts, but with dumb option it was worse.

I have a following impression that the bottleneck could be the send_nsca script and the slaves, because at the end the central server starts actively check all the remaining services.

Do you have any recommendations? I went through all articles, documentation , books available for Nagios. Do I need more slaves?

Thanks,

Marek