Nagios doesn't update statuses in multiple failures


#1

Hello all,

I’ve got strange problem with new Nagios 2.5 and big failure of hosts…
I’m monitoring about 400 of services and about 200 of hosts, and
few days ago I had big failure (about 100 services). The problem is,
that the statuses were updated very slowly… after 30 minutes
I had about 15 notifications with critical status.
Of course every service is check every 5 minutes, but during
failure I could see in some service I knew is not working
something like this

Next Scheduled Active Check: 11-21-2006 09:42:31
Latency: 252.565 seconds

and it was 09:56 during checking… why this service
wasn’t checked? Or more truly - why nagios didn’t checked it
as “critical”, becouse I belive it was checked.

Now I’m trying to reproduce the problem with iptables…
I have nagios 2.5, I cut off some part of network with

iptables -I OUTPUT -d dest_network/24 -j DROP

which cuts off about 100 of hosts (and 100 of PING service).
After 25 minutes I had only 10 hosts recognized as wrong,
the situation with “next scheduled active check” occured again.
Of course no DNS service, gateway etc is not in this network.

The very strange is, that when I disable blocking with

iptables -D OUTPUT -d dest_network/24 -j DROP

suddenly 70 new critical statuses were in the nagios www page…
Why is that? If I disable few (2-5) of services - everything
works fine, I have quick information. But if I disable many
(tens, hundreds) of service - everything is veeery slow.

Below is my nagios.cfg config (not all, without spam informations)

accept_passive_host_checks=1
accept_passive_service_checks=1
admin_email=nagios
admin_pager=pagenagios
aggregate_status_updates=1
auto_reschedule_checks=0
auto_rescheduling_interval=30
auto_rescheduling_window=180
check_external_commands=1
check_for_orphaned_services=0
check_host_freshness=0
check_service_freshness=1
command_check_interval=-1
daemon_dumps_core=0
date_format=us
enable_event_handlers=1
enable_flap_detection=0
enable_notifications=1
event_broker_options=-1
event_handler_timeout=30
execute_host_checks=1
execute_service_checks=1
high_host_flap_threshold=20.0
high_service_flap_threshold=20.0
host_check_timeout=30
host_freshness_check_interval=60
host_inter_check_delay_method=s
illegal_macro_output_chars=~$&|'"<> illegal_object_name_chars=~!$%^&*|’"<>?,()=
interval_length=60
log_event_handlers=1
log_external_commands=1
log_host_retries=1
log_initial_states=0
log_notifications=1
log_passive_checks=1
log_rotation_method=d
log_service_retries=1
low_host_flap_threshold=5.0
low_service_flap_threshold=5.0
max_concurrent_checks=0
max_host_check_spread=30
max_service_check_spread=30
nagios_group=nagios
nagios_user=nagios
notification_timeout=30
obsess_over_services=0
ocsp_timeout=5
perfdata_timeout=5
process_performance_data=0
retain_state_information=1
retention_update_interval=60
service_check_timeout=60
service_freshness_check_interval=60
service_inter_check_delay_method=s
service_interleave_factor=s
service_reaper_frequency=10
sleep_time=0.25
status_update_interval=15
use_aggressive_host_checking=0
use_regexp_matching=0
use_retained_program_state=0
use_retained_scheduling_info=0
use_syslog=1
use_true_regexp_matching=0

I would be gratefull for any help…

regards,

Sierp


#2

ok, I know why I had a problem…
Iim using for every host the PING service. This works
fine, but additionally, I’m using check_host_alive for
every host… When I switched it off - everything worked fast
as I wanted it to work.
I don’t know why, but checking if host is alive is not
paralelized - so after each host I have to check 100 seconds
(10 x 10s of timeout) for execeution. I still don’t know,
why PING service wasn’t critical and nagios checked additionally
the check_host_alive.

The main question - is there any reason to stay with the
check_host_alive configuration? For now I can see quick
problem, which is notifications if the host is down,
but I think I can workaround it with notification for PING

The other option is to tune a little bit check_host_alive, like timeout set to 2s or count of
checks change to 2 times (I dont know yet how, but I will find out I hope)

TIA for suggestions

Sierp


#3

"and
few days ago I had big failure (about 100 services). The problem is,
that the statuses were updated very slowly… after 30 minutes"
Well, not really. The problem is that you most likely don’t have your hosts parent function setup properly.

Did you really have 100 hosts or 100 services down?
Anyway, if you have nagios setup correctly then you should have had only ONE host down, and the rest of them would have been simply “unreachable”.

For example:
I have nagsio setup to check 100 hosts and each one runs a httpd service.
So, I unplug the cable from the nagios pc, and what do I get? I get 100 host down emails.
Now, I decide to do the nagios install the way it should be installed, and what do I get? I get an email showing the nagios-eth0 host is down. That tells me that the cable is unplugged from the eth0 interface on my nagios box.


#4

So, your situation is normal.
A service check fails, so it runs the host check(nagios stops doing ANYTHING else, no mutlitasking at all, just stops). The host check fails, so it check’s it’s parent. No parent? So the host is marked down. Then it finds another service down, and so on…
It’s normal for nagios to slow down when you have a huge number of poorly defined hosts and all of them are failing to run the host check. If these hosts are not really down, but unreachable, then nagios would run just fine.


#5

Could you please tell me some clue, how that should be correctly configured?
I had really 100 hosts down (moreover), but, to be honest, that was not all of host
which are monitored. I’m monitoring about 200 hosts. This 100 hosts are printeres
and other stuff, another 100 is servers/services.
I had power failure, this 100 hosts (printers) went down, and I had problem - each
printer were checked one by one, and I didn’t have any information about servers,
becouse I’ve been waiting for printers to get recognized as down.

Now, I’ve changed configuration like this:

  • I don’t check_host_alive of the printers - it’s pointless as long as printers are monitored
    by ping either way
  • I changed timeout (from 10 to 3 seconds) and retries (from 10 to 2) for check_host_alive for servers
    Now it’s working much better, but maybe there is some better solution?

#6

OK, yes, there are times like that when nagios can get real screwy when you really have 100 hosts down.
So yes, you are on the right track. The resolution is to make your host check as quick as possible.
Here is most likely the best method.
First, get and install this, check_icmp.
oss.op5.se/nagios/
run it and read what it says.

./check_icmp If this program is invoked as check_host (with a symlink, preferrably), it will exit with status OK upon the first properly received ICMP_ECHOREPLY, making it ideal for hostchecks (less network load and much, much faster on OK)
So, as it says, make a symbolic link. But first change the permissions on the check_icmp file.

-r-sr-x--- 1 root nagios 38717 Nov 16 10:32 check_icmp lrwxrwxrwx 1 root root 10 Nov 16 11:59 check_host -> check_icmp
Define a command called check-host as instructed by ./check_icmp -h

define command{ command_name check_host command_line $USER1$/check_host -H $HOSTADDRESS$ }
Define your host check to use that command.

define host{ name generic-host check_command check_host ....
Now, when you have a host failure, the host check is going to run very quickly, and should not slow you down much.