I have several remote sites that sit behind a load balancing router. I would like to be able to monitor both connections going in, but still have nagios see it as one host. Is this possible, or is there some sort of workaround?
I suppose you could “clone” the service check command object definitions and specify the secondary connection’s IP address instead of using the $HOSTADDRESS$ macro, then use them for the service check objects on the secondary connection, if you see what I mean… so for your command object definitions it would look something like…
define command{
command_name check_http
command_line $USER1$/check_http -I $HOSTADDRESS$ -u $ARG1$
}
define command{
command_name check_http_SiteA_Con2
command_line $USER1$/check_http -I 10.10.10.1 -u $ARG1$
}
define command{
command_name check_http_SiteB_Con2
command_line $USER1$/check_http -I 172.16.0.1 -u $ARG1$
}
Then for your Site_A host you’d define service checks using check_http and check_http_SiteA_Con2, for Site_B use check_http and check_http_SiteB_Con2 …and so on. Yeah it’s a bit of a cludge, but it might work…
Strides’ solution is a good idea.
Still, if it can help, here’s another idea (that’s what we use):
create a small script that will call check_http, but this script will take 2 IP address/hostnames, instead of just one. In nagios, the host will be a “virtual host” (active_checks_enabled set to 0 and no “address” directive)
The algorithm would be like:
get all the arguments, especially host1 and host2
do a check_http (or check_ping or whatever) on host1
do the same on host2
use the results above to determine the exit:
if 0 error => exit 0
if 1 error => exit 1 (warning)
if 2 errors => exit 2 (critical)
(else: exit 3 => unknown)
that’s what we usually do, because if one route is down, it doesn’t mean the service is down, because it is still working, so warning is a good way to say it
and also, if the host is down, you will have only 1 alarm instead of 2.
Nice Loose! An elegant solution, that one, and avoids the limitations and repetition involved in my ‘cludge’. OP, if you can script it, I’d have a go at doing this if I were you. If you can’t, by all means cludge it for starters, but learn how to write the custom script as soon as you can and have a crack at getting Loose’s solution working.
Once you can write such things, it is easy to bend Nagios to your will
Awesome, thanks guys.
Scripting it shouldn’t be a problem, but how do I go about doing it? Could I just throw it together in perl? Once the script is done, do I just place it with the other commands and write a command definition?
Yep, perl is as good as anything imho…
Yep (remember to alter the ownership and permissions as per your other checks… usually nagios:nagios and 755)
okay,
as an extension to this, behind this router, I have another network that I need to monitor. I have another linux box back there that I will run nagios on as a distributed node.
How can I extend that so that nagios knows which IP to use based on the results of the check on the router?
Um, I’m not entirely sure I follow, the introduction of a second nagios install as a distributed server gives a certain duality to “so that nagios knows which IP to use”… as in which nagios, and IP to use to do what?
If you mean central nagios talking to disty nagios, then my understanding was that it works the other way around, that is to say that the disty nagios submits results to central as passive checks, and central nagios IP address stays the same regardless of load-balancing state - your router should sort out the routing to central itself… That being said, I don’t have a big enough environment to run distributed servers and thus never actually used it in anger, so I could well be talking out of my hat anyway.
I reckon though it is more than likely you mean something completely different - if you could explain in a little more detail I’ll see if I can try and keep up
nevermind, I realized last night that I was thinking about it backwards
okay, I have it working (sort of).
I am hitting one major issue and one minor one.
The major issue:
The service status is being listed as ok, but the host continues to be listed as ‘offline’. I’m assuming it’s some configuration issue with the host?
The minor issue:
I wrote the plugin to return the results of the check_ping, so I can see all them in the status information box. I would like the results from each interface to be on it’s own line, but cant figure out how to do it. Multiline output is supposed to work in version 3.
edit:
fixed number one. I ditched the service definition and put the check in the host block. nagios whines at me on config check that I don’t have a service associated with the host, but (i think anyway) that is pretty trivial
Thanks for the help so far guys. Between you and the docs I’m starting to get a handle for how this all fits together.
I’ve been thinking over my logic and I want to float it past you.
I have a host entry for the router. It contains a check_command that points to the script I wrote (that takes multiple IPs as arguments).
Then associated with the host I have a service defined for each WAN ip using the check_ping method.
My thinking is to set the check interval on the host to 0, so only the services are being regularly monitored. If one of the services (WANs) happens to switch into a hard state, the script that checks all interfaces is run and updates the status of the host.
Does this sound logical? Can I set the check_interval on the host to 0 to prevent it from checking by itself, or should I just set it to some arbitrarily large value?
I’d say: don’t worry with the check interval for the host;
in fact, nagios doesn’t really do host checks on a regular basis (except if you set use_aggressive_host_checking to 1 in nagios.cfg).
=> nagios “only” do service checks; whenever a service goes DOWN, it tests the host … in the end, that’s what you are looking for
(that’s why nagios wants a service for every host, btw)
sorry to keep dragging this out…but I have more questions.
I haven’t deployed the distributed monitoring yet, and even then, I have some sites where that approach isn’t practical.
What I have now are 2 types of gateway routers. A load balancing router (as discussed before) and a failover router, that is supplied the primary and fail over IP. It tests the primary and returns that result unless packet loss = 100%, then it tests the secondary and returns that result, escalating an ‘ok’ response to a warning. This check is the host_check script for the device, and I have define a service for each WAN interface I have (so it checks the WANS individually as service checks, then checks them both with the host check if an issue arises)
Anyway… I have some questions with this set up.
-
With both setups, I have a device that I monitor via snmp behind the router. This device is accessed by forwarding a port on the Gateway IP. So if that particular interface goes down, the device is seen to go down since it’s parent is still up. I assume I can fix this with dependencies… If I make the services on the device dependent on the associated ‘WAN’ service on the router, will that be enough? Does this even make sense?
-
Is it possible to reference host macros inside the same host definition or within a service definition? I would like to define an on demand macro for each WAN interface inside the host definition, then reference that inside the check_command string for the host and service. I guess it would be easier (although not very flexible) to just create a custom command that looks for _wan1 and _wan2, but that doesn’t help with the service call. I guess here’s some code of what I would like it to look like:
define host{
use LoadBalancingRouter
host_name Public Gateway
alias Public Gateway
parents T-1, DSL
hostgroups Zywall 35 UTM, Firewalls
_wan1 ***.***.***.***
_wan2 ***.***.***.***
check_command check_multi!-w 200.0,20% -c 600.0,100% -t 2 -p 1 $_wan1$ $_wan2$
check_interval 0
}
define service{
host_name Public Gateway
use critical-service
service_description WAN 1
check_command check_ping_dist!$_wan1$!200.0,20%!600.0,100%
}
define service{
host_name Public Gateway
use critical-service
service_description WAN 2
check_command check_ping_dist!$_wan2$!200.0,20%!600.0,100%
}
but I tried it and no dice.
[quote=“vortmax”]… I assume I can fix this with dependencies… If I make the services on the device dependent on the associated ‘WAN’ service on the router, will that be enough? Does this even make sense?
[/quote]
Dear Vortmax,
I’m afraid that Service Dependencies won’t work as expected. Take a look at:
viewtopic.php?f=59&t=5104&p=16828&hilit=dependencies#p16828
Hope it helps.