What does check_command in host def. do?


#1

Just wondering what this setting is supposed to do as the sample configs give this a value of ‘check-host-alive’ but it never runs.

I was pinging 35 hosts and all worked well then I noticed that ‘check-host-alive’, as used in the host definitions sample, does just that.

Now I’ve configured so that I’m just checking the DNS service on 3 machines and they are the only ones that provide a non pending status, and they are only ‘assumed’ to be up.

Am I missing something as I don’t get why I need to explicitly define a ping service for a switch/router so that Nagios will perform the ping function configured in the host definition.

I know I’m probably missing something but the docs aren’t too clear on a number of things.

Using 1.2 on OS X btw
Edited ]


#2

Yea, I spent a week triing to figure that one out. Host checks in < v1.2 never occur unless a service check fails. The logic is this. Why bother to ping a device with the “check_host_alive” check, if the service on that host returns an OK state? If DNS is running, then surely it will ping right? To get the host to not show “(Host assumed to be up)” you can create a “dummy” service that you know will fail, and then when that check fails, it will then “check_host_alive” or ping the host, to see if the box is dead.
Bottom line is this, leave it showing “Host assumed…”, since actually makes alot of sense don’t it" The host HAS to be up, if a DNS check returns OK.
hahahaha, I love that problem, since it bugs everyone, including me. 2 weeks of wondering shot to hell, for me…


#3

interesting info :slight_smile: would be nice if the check could be forced…
well thinking about it… defining the host with only the check host alive, and only after the first check should give it a hard state…
I’ll see soon enough next week i’ll have to put nagios2 online so i’ll make some tries :slight_smile:

Luca


#4

Thanks for the confirmation Jakkedup, after several more RTFMs I figured the logic so to minimise checks I just removed ‘check-host-alive’ from all host that I just wanted to know if they were pingable and gave each one a ping instead.

My logic being, if ping fails then ping will fail :wink:

Does this logic change for v2?


#5

I think in version 2 host checks can be done on a regular 5 minute interval, or whatever you set. But why did you "removed ‘check-host-alive’ from all host " That’s a big mistake. For HOSTA which has 3 service checks done on it, if the service check fails, then we want to know if the host is dead too. So please don’t REMOVE the host check. It’s important. Just because ftp is NOT running, then you will know if the host is dead or not and can fix it remotely, or whatever. I see that the service you are checking is “ping” and that the host check is the same thing. But in the future, as you start to progress, there will be tons of services that you will be checking, and you may want to define a host “check_command”.
This is from the docs…
"check_command: …If you leave this argument blank, the host will not be checked - Nagios will always assume the host is up. This is useful if you are monitoring printers or other devices that are frequently turned off.
In certain cases, by not defining this, it may give you a headache. Example: Switch1 connects to switch2(so switch2’s parent host is switch1, in the hosts.cfg file). If I define service checks for both switch1 and 2, but don’t define HOST check commands, then if switch 1 dies, then I will get a page/email from nagios about both switch 1 and switch 2 being dead. But in fact, switch 2 would be “unreachable” and I don’t get “unreachable” emails. Nagios could not tell that switch 2 was unreachable, since it is assuming “host’s are up” This host is NOT up, but simply by you/me not defining "check_host-alive_ check_comand, it is ALWAYS assumed up. In other works, please define a host check command for all host’s. It will give you headaches later on, when you have 298 host’s 1071services and are getting 200 emails or pagers just for ONE host being down, and a 199 host’s unreachable(of which, you don’t want). YOu want ONE email, telling you what is broke, and that’s all.

You can force the check by disabling/enabling the host checks by using the external command link “Disable checks of this host” and “Enable checks of this host”.


#6

Thanks for pointing out my inability to RTFM :wink:
I now understand the need for a dummy service.

Though if you had read my reply then you would have realised that I wasn’t ignoring host checks for devices with ‘real’ services on them. :stuck_out_tongue:

This whole host checking logic is backward, why bother checking services if the host is down? Maybe there is an argument for saving bytes on a distributed network but to my mind ping is the lowest resource check that you are likely to do. If v2 reverses this logic then roll-on a stable release.

#7

uhm… i think services are NOT checked if the check_host_alive fails… but check_host_alive is only done when a service fails (or if there are no services to check).
BTW i tend to use a check_ping anyway on all machines to see the network state… ping return times (or losses) can sometimes be quite helpful.

Luca


#8

It appears that you have missunderstood how nagios works in v1.2. It check’s services only. If a service fails, it will then check to see if the box is dead. Both of these are important for the user, since he will now know, that not only is the ftp deamon dead, the box won’t ping either. So he is going to have to go to the box and find out why it won’t ping. Otherwise, you could just telnet to the device, and restart the ftp daemon(or whatever service). By using host dependancies, host’s will not be checked, if a host that it is dependant on, is not alive.

I find host/services dependencies to be very confusing, and I did configure it once, with several devices. Problem I found was, that if I unplugged the net cable to the nagios pc, then of course, every service check will fail. It took nagios far too long to walk the dependancy tree, to finally give me the results of “what service is broken and what is just ‘unreachable’”