I’m trying to speed up the time Nagios is taking to “converge” on a problem host.
I’m using the “execution_failure_criteria” feature (in both the service dependency definition and host dependency definition) to prevent service/host checks occuring on hosts where the parent host is down. I’m doing this in order to avoid Nagios checking for services/hosts that would not be available.
This is working great - in that it is correctly preventing unnecessary service checks from taking place and the problem host is being identified quicker.
However, because the check never takes place, the child service (and host) are always marked “OK”. It would be nice to be able to mark these as “unreachable” as soon as the parent host becomes unavailable, as indeed does happen - eventually - when all the child services/hosts are checked.
Anyone else doing or trying this kind of thing? (Running Nagios 2.0b3 by the way).
Thanks in advance.
I messed around with service/host dependencies. I monitor all the switches/routers in the company network and as you know, one usually depends on another, from nagios’ point of view.
But when I unplugged my network cable, it took a very long time for it to walk the dependancy tree, and find out, that the problem host was eth0 on localhost.
So what I settled with, was using the parent/child relationship that you give each host in the host.cfg file. Build that correctly and you will have only one host down and the rest will just be unreachable. Sure the checks may still occur, but at least you can then setup your notifications for only “down” “recovered” and not for “unreachable”.
We’ve got some very deep levels of dependencies here - up to 10 at most (we’re an ISP). I have all the parent/child relationships set up, plus dependencies for everything. (Not monitoring individual switch ports as well yet, though!)
I found that walking the dependency tree took too long, partly because of the depth involved and partly due to the number of checks I was doing. It’s now not too bad, where the problem host will get identified fairly quickly.
The problem is, dependent hosts are NOT getting marked Unreachable - they’re remaining at the OK state because once their parent is down no further checks are being made. This is the “execution_failure_criteria” that is introduced with Nagios v2 I think.
I don’t want to have Nagios waste time checking hosts that should be known to be unreachable, but I would like those hosts to be marked as unreachable on the status map.
I may just have to have a try at modifying the code for this one!
Yes, you are correct, it won’t bother to check the dependant host’s. But why would that be a problem? You surely don’t want to be notified about those other host’s, since they are most likely fine. You don’t want them to clutter up your status.cgi screen either. So if you can live with the amount of time it takes to walk the tree, I think it’s great that they are not marked as unreachable or even checked.
But, if you truely need them to be marked unreachable, then just use the parent/host relationship in the hosts.cfg file, and ignore the service/host dependancy.cfg files. That’s what I do, since I didn’t like the amount of time spent to find the problem host. And yes, all my hosts that are marked “unreachable” are still being checked every 5 minutes, but I haven’t found that to be a problem.
It’s not really a problem for me, but for others (for example bosses and receptionists) are not savvy enough to work out an upstream problem.
Anyway, I’ve removed the dependency checking so I do have it checking all the services once more. It’s not ideal but I can live with that. I also noticed that the roadmap for v3 of Nagios includes improved checking for downstream hosts so things can only get better.
Many thanks for your input.