Rings and Parents

slebrun · July 24, 2009, 3:30pm

Imagine a ring:

F / \ d e | | b c \ / AEach of these are host groups.
Host group A contains a router, a switch to point B and a switch to point C
Host groups b, c, d, e each contain a switch from the ‘downstream’ point, a router, and a switch to the ‘upstream’ point.
Host group F contains a switch to point d, a switch to point e, and a router.
These all have parents assigned. In other words, to get to router for hostgroup d, you go in this path:
router A
switch a-to-b
switch b-from-a
router b
switch b-to-d
switch d-from-b
router d

Now, if, say, the link betwen A and b is interrupted, actual traffic will flow around the ring and there’s no real issue.
If, however, nagios detects the switch at hostgroup B is not responding, it will mark everything else in hostgroup b, and everything in hostgroup d, as ‘unreachable,’ when it is, in fact, not.

Now, the easiest way I can see to fix this would be the option to force hostchecks on devices Nagios considers to be unreachable; if the devices can, in fact, not be successfully host-checked, mark them as ‘down’ or ‘unreachable’ as the host-availability logic would normally go. If they are successfully host-checked, mark them as ‘up’ and move on with life.

The harder way, of course, would be to get Nagios to recognize the concept of ring topology…

Any other ideas? Will enabling aggressive-host-checks do as I’ve suggested above? The docs lead me to think that all that option does is cause warnings from hostchecks to mark the host as ‘down.’

I’m open to suggestions!