Maximum monitored hosts and services?


#1

Hello everybody! I have very great performance problem now:cry:
There are 5700 hosts and 7058 services in my nagios. Server is RHEL3, Nagios 2.5, 8gb RAM, 8 CPUs.
All checks are active - pings (icmp) and sql (grep from txt file). Host checks are disabled.

Performance is awful! It can’t perform all checks for an hour! When I had 3000 services the performance was great-all tests made in 5 min. But our environment increases fastly-many hosts added and many services too.

I can’t understand WHY nagios almost does nothing? When I watch nagios processes I see only 10 or 20 checks! Why it cant’ perform at least 50-100 checks simultaneously? Then the performance will be VERY WELL!!

So, I want to know how many services can nagios perform maximally in 5 min? I made tuning of nagios - all performance tips from docs and inet (forums and nagios-users). But now I see that nagios works bad when it has 7000 service checks. Maybe version 3.0 will solve my problems?

Can anybody tell me if he monitors so many hosts? What is the configuration of the nagios server, what kind of checks are performed and so on? I want to see some kind of statistics of nagios performance

Help me please!:frowning: I am in despair.

Kristine


#2

Hi!

good news for you: I had “exactly” the same problem: our nagios performance was going down due to to many tests…

We’ve made extensive perf tests to try to find the max numer of services allowed.
We found that -on our server-, this limit was between 2500 and 2900services; above, the latency would increase exponentially.
On a better server, this limit doesn’t increase much…

The conclusion was that nagios (the soft in itself) was not able to launch more than 5-6 services per second… that is likely to be your problem too.

The solution ?
ours: we created 2 new accounts: nagios2 and nagios3. On these 2 accounts, we installed for each a nagios server, and we dispatched all our tests on these new nagios servers.
almost the same solution was to implement 2 new nagios servers on the same account, sharing one apache server … but this solution is quite tricky to implement, and prone to more bugs - thus your choice

Although, you should be aware that in both cases, running multiple nagios servers will increase your CPU consomption (and memory, but we’ve never had problems with that).

I hope this will help you; and that you’ll be authorized to implement this easy solution.
Don’t hesitate to ask other questions on this subject :slight_smile:

btw: I tried to ask here if nagios 3.0 would help solve the problem by itself, but I had no answer … maybe you’ll get lucky (I can’t help you: we’re running nagios 2.10; and this version has the same latency problems (if not worse)).


#3

Hi!

good news for you: I had “exactly” the same problem: our nagios performance was going down due to to many tests…
*
Thank you for your answer! I am glad that I am not alone. :)*

We’ve made extensive perf tests to try to find the max numer of services allowed.
We found that -on our server-, this limit was between 2500 and 2900services; above, the latency would increase exponentially.
On a better server, this limit doesn’t increase much…

What you mean “better”? More powerful? What is the configuration of your server? I think it does not matter what server we use. On my nagios server processors are not loaded at all and there is 5 GB free! RAM

The conclusion was that nagios (the soft in itself) was not able to launch more than 5-6 services per second… that is likely to be your problem too.

Mine launches almost 20…

The solution ?
ours: we created 2 new accounts: nagios2 and nagios3. On these 2 accounts, we installed for each a nagios server, and we dispatched all our tests on these new nagios servers.
almost the same solution was to implement 2 new nagios servers on the same account, sharing one apache server … but this solution is quite tricky to implement, and prone to more bugs - thus your choice

I think it is not very good because it is difficult to have 2-3 servers with 2000 services per each for example. If one dies…But if NOTHING solves this problem it will be the one solution in my situation

btw: I tried to ask here if nagios 3.0 would help solve the problem by itself, but I had no answer … maybe you’ll get lucky (I can’t help you: we’re running nagios 2.10; and this version has the same latency problems (if not worse)).

I will wait with big hope that somebody helps us:cry:
If not I will have to implement another monitoring soft in order to suite our large environment
It will be very sad for me and other nagios users if nagios can’t perform AT ALL more than 3000 checks in 5-10 min because of VERY LARGE latency and that this is the soft limitation which we-nagios users and admins can’t change by ourselves.:cry:(


#4

yes, that’s what I meant :slight_smile:
note: we’re running solaris 8 and solaris 10 (no differences on this matter between these versions).

I’m quite surprised to read that;
It might mean that our conclusions are wrong; maybe this limit is OS or server dependant …

Well, in my case, our supervision has different “domains”, so the separation on 3 servers was logical (ie: software supervision on the 1st server, server and database on the 2nd server, and network equipments on the 3rd). In the end, it even helped us: whenever we see an alert in netcool (we use Netcool as our hypervisor), we know from which server the alert comes from and, because there are less services on it, we can find it more quickly :slight_smile:

Also … if one server dies: you’ll know it quickly, because you will have it supervised by the 2 other servers (after all, who else than the supervisor is more apt at knowing something’s wrong ?).
More than that: if a server dies, you’ll lose only 1/3 of your supervision … imagine now that your main server with 7k services dies ? ^^

In the end, I can only find advantages to this solution :slight_smile:


#5

[blockquote]yes, that’s what I meant :slight_smile:
note: we’re running solaris 8 and solaris 10 (no differences on this matter between these versions).[/blockquote]

I meant what CPU and how much RAM is on your server. I wanted to say that if my 8 CPUs are not loaded the problem is not in hardware.

[blockquote]I’m quite surprised to read that;
It might mean that our conclusions are wrong; maybe this limit is OS or server dependant …[/blockquote]

Oh, it is really strange. I agree that difference is in our OS but maybe there is smth else…in our configs for example
Are these parameters in your nagios.cfg are the same as on my server?

service_inter_check_delay_method=n
host_inter_check_delay_method=n
service_interleave_factor=s
max_concurrent_checks=0
service_reaper_frequency=2
sleep_time=0.10