Hi all,
I’ve just installed nagios 2.0 and I have to monitor a lot (some hundreds) of routers. Initially I’ll monitor only via ICMP (reachability). do I have to create a service PING per host (what a long and boring work…), or there is another way?
Rumor has it (I say rumor because I’ve never tested this myself, but others have said it works), is that you can create a hostgroup for the machines you want to ping. Then, in your service definition, you put hostgroup_name instead of host_name.
Alternatively, you could use host_name and have a comma-delimited list of the routers you want to ping.
Anywhos, ping is the default method for checking that a host is alive, so simply defining the host in the hosts.cfg file will have Nagios running pings on it, so it’s not really necessary to run more ping checks (unless I’m missing something or am horribly mistaken). You just wouldn’t see them in the “Service Overview” page on the webint that way.
Personally, I think it’s a mistake to take the hostgroup or comma delimited shortcut route. For each host/router you should have a parent host defined for it in hosts.cfg file. This will give you a good idea of what router has failed, otherwise, unreachable routers may be reported as “Down”. When in fact they are simply “unreachable” due to another router blocking your path.
The other thing is, for each and every router, you should be graphing the output of “ping” with nagiostat. This may not be important right now, but it may in the future help you find trouble on the network. In my case, I’ve noticed that as of Sunday at 6PM, there has been an unexplainable increase in rta’s for most of our network. WIthout the graphs, nobody would even know that our rta’s have increased from 50ms MAX to 200ms MAX.
Also, with each and every router, there can be many other checks that are very helpful. For example, Power supply 1 and 2 up, fan #1,2,3 RPM’s, and others.
If you take the hostgroup route, let us know how it works out. I’m we are diing to hear if this actually works, since it’s not documented and others have said YES, it does work.
Thanks, now I’ll try with hostgroup.
I tried using check-host-alive, but it seems not to work: After some Hour nagios has never checked router state. which directive do I have to use to schedule host alive checks?
@jakkeup: I monitor all routers of my network, so it cannot be that an intermediate router interrupt my monitoring
host checks are not scheduled. A host is only checked if the service check for that host fails. To start with, define a service check in services.cfg that checks that host with the check_ping plugin.
ok so I have to create a service to check if an host is alive, or to take the hostgroup route… creating (and ignoring) a service PING for all routers (or grouping them per customer, for example)
I work for a big outsourcing company (no names :D) and they use already Openview, Vitalnet, Webtrends and other software. So I initially want only to check router reachability. Next we will see…