I seem to have a little problem with nagios when one of the hosts i monitor goes down.
Context:
1 Nagios server
8 Proxies being monitored, in total 79 servicechecks distributed over the proxies.
All are openbsd 3.8 systems.
The scripts are all customized (written myself). I use ssh with pki authentication to tranfser the scripts from the nagiosserver to the nagiosclients, execute them and get the results back (and remove the scripts from the clients).
If a service goes down nagios reports that very quickly (for testing purposes i have set all normal_check_interval to 1 minute). But a problem occurs when a host goes down.
I read it an currently i am using the check_icmp for check-host and check-ping purposes. It’s much faster then the regular check_ping binary.
But nagios still seems to be agitated when hosts go down. Does anyone know how nagios deals with down hosts now? Are all servicechecks on the hosts which seems to be down put to a hold? That sounds good, but i would want that the service checks on ‘healthy’ hosts would continue… Is there a way to realize this?
Are there more users experiencing these kind of problems? And if so, what are ur solutions?
Hosts that are still healthy WILL get checked, but only after nagios has determined if the host that service-A has failed on, is up or down. So, it’s important to make that “is host up or down” check very quick. It should NOT have retries over 1, and it should be a very quick check, like check_host via the check_icmp command. After the host is found to be down, all other service checks will continue normally.
Plus, the service that does fail, will get rescheduled also. So you don’t want to spend a huge amount of time triing that service over and over again. Let it fail after 1 retry, but 5 retries seems to be too high for my liking.
I agree with u that the hostcheck should be very quick. For that reason i implemented check_icmp, it’s a matter of seconds then. But it still seems to check all services for that down host to and i don’t want it to do that because that takes alot of time. Is it possible to disable those services checks when it’s know the host is down? Some sort of host-service dependency…
Yes, but not automatically. I for one, need to know when the service check is good again. I’m not going to disable any checks. The idea is to fix the box, and not fix nagios. Disable the checks manually if you wish. That is what I do, when I know a box is going to be down for a long time. It saves alot of cpu cycles that way.