Performance issue on host pings


#1

I am having performance issues on our Nagios 3.0.2 box. I have 340 hosts and 515 Services being monitored. I noticed this past weekend that Nagios is not paging out and is not catching server reboots. I moved off all of our routers and switch off of this Nagios server whiched dropped the number of hosts to 225. I am seeing some improvement but not that much. We are still not getting alerts on VM server reboots.
Here is my process info:

Actively Checked: Time Frame Hosts Checked
<= 1 minute: 132 (58.7%)
<= 5 minutes: 214 (95.1%)
<= 15 minutes: 225 (100.0%)
<= 1 hour: 225 (100.0%)
Since program start: 225 (100.0%)

Services Actively Checked: Time Frame Services Checked
<= 1 minute: 169 (32.8%)
<= 5 minutes: 516 (100.0%)
<= 15 minutes: 516 (100.0%)
<= 1 hour: 516 (100.0%)
Since program start: 516 (100.0%)

We need to be alerted on servers rebooting.

I have changed my setting to the following

Your config is set for

check_interval 1
retry_interval 1
max_check_attempts 3

But this isn’t helping. If anyone can offer some suggestions I would apprieciate it. I do not want to have to set up another server.


#2

I think those numbers are ok from performance info. Mine are even lower then that and I have more hosts and more services then you do. Nagios still reports every host unreachability and services which are failing.

Have you checked in your nagios.log or Event log on web GUI to see if Nagios has even checked those VM servers? How long does it take for them to boot up and become reachable? If that is within 3 minutes, then it is normal for Nagios not to report the issue, because you’ve set max_check_attempts to 3, and combining that with check_interval and retry_interval values of 1 that would be 3 minutes (assuming you have interval length in nagios.cfg set to 60 seconds). If that still is not the case, maybe you shouldn’t set all services and all hosts to be checked every minute. I’m pretty sure there are things that don’t have to be checked that often, which would help nagios to redistribute scheduled checks in more time and then it would be able to finish all checks in defined time manner.
Check the log for those VM servers and see if they end up in a hard state in any check.

On what value do you have set sleep_time value in nagios.cfg?
Let’s say it is set to 0.25, that would be 1/4 of a second. And mathematically, for your 515 services it would take 128,75 second to finish checking those, which is about 2 minutes and a bit more. In that case all checks couldn’t be checked within a minute.

Also, have these variable enabled in nagios.cfg if you often restart/reload nagios in these tests:
retain_state_information=1
use_retained_scheduling_info=1

Also fill other associated options correctly. These will help in retaining prevoius states and would lower the number of initial checks upon restarting Nagios