Organizing a legion of hosts

I’ve successfully configured Nagios 2.0b2 and utilized the nmap2nagios.pl “extra” to import several subnets worth of hosts from my data center. Unfortunately, the status map layout is . . . less than beautiful . . . and I’ve only completed the first few subnets!

What I’m looking for is some advice: how did you organize your parent / child relationships? Any advice on using the check_cluster2 contrib? Is my only hope to use user supplied coordinates?

I’ve thought about adding ‘dummy’ parents on secondary IPs of reliable hosts in given ‘service clusters’ just to allow the status map to sort itself out better. Similar thoughts with using the ‘cluster IP’ for host clusters as a parent to the host cluster nodes. I’d love to hear your thoughts on problems or benefits that these approaches might create.

on nagios1 i worked out the whole user defined coords (100 and some hosts) on nagios2 i’m not going to do it… :slight_smile: i never looked at the map so i think it’s not worth the time it takes.
Jakkedup spoke of nagiosmap once, don’t know how it works haven’t tried it out yet.

Luca

Have you tried usign Nagiosmap? It’s a little extra that allows you to shuffle the icons about and it’ll modify the hostextinfo co-ordinates for you.

The problem of just dumping a bunch of hosts in nagios, is just what you got now. You have no parent/child relationship, so therefor, the map looks screwy and doesn’t really show you how your network is cabled together. The nice thing about the $10,000 commercial programs is, that they actually map out your network, cable by cable. Nagios is not that costly, so you will have to do it like I did. i.e. by hand.

For example:
Nagios pc is using eth0 that plugs into Switch1 port1.
Switch1 port 24 connects to Switch2 port 1
Switch2 port 24 connects to router1 port 2.
Router1 has 3 other ports, 1, 3, 4
So, the hosts file would kinda be like this.
Nagiospc No parents
Nagiospc-eth0 parent NagiosPC
Switch1 port1 parent Nagiospc-eth0
Switch1 parent is Switch1 port1
Switch1 port 24 parent is Switch1
Switch2 port 1 parent is Switch1 port 24
Switch2 parent is Switch2 port 1
Switch2 port 24 parent is Switch2
and so on.

As you can see, I’ve looked at our network, by hand, to see what ports connect switches together, routers, etc. This will resolve your problem with the map, since now nagios will know that nagiospc-eth0 is connected to switch1 port1, and switch1 port1 is(of course) connected to switch1, and switch1 port 24 is connected to… See how this goes? The great thing about doing this, is that you can now, have some yoyo, unplug every single cable on your network, and YOU WILL BE ABLE to recreate the entire wiring just by looking at your map.

The other added benefit, is that now, when ONE cable unplugs, you won’t have an entire page of Critical’s. You will only have ONE critical, and the rest of the host’s will be status “UNREACHABLE”, since now nagios knows that "hey, these devices are children of switch1, port1, and since switch1 port1 is dead, then the rest must be unreachable.

Oh yea baby, this is the real powet of nagios, when your boss and you can see exactly how your network is wired. But it takes a week or more to just map it out on paper.

HAve fun, I know I did, and now we have the most beautiful network map, you ever saw.

And yea, use nagiosmap to move the stuff around by hand, AFTER you have done the above, if you don’t like the way nagios draws it.
Edited Sat Mar 12 2005, 03:43PM ]

Back to my favorite topic, the map. I realize that to many, doing the above seems a bit too much. I don’t know how critical repair time is on your network, but we can’t produce a product without it. When I first started working in this network group, we would get calls like “I can’t get to the internet”. So after 45 min’s of pinging this device, that device, we would finally find the problem and fix it.

Setting up nagios in a basic fashion as you guys have done was a great first step, but when problems occurred, all we got was dozens of hosts being down. So now we have the same problem as before, i.e. which device is causing the “Network OUtage”. Surely not all devices are dead.

So, setting up parent/child relationships is pretty much mandatory for our network. Taking it even further, by including snmp checks of the port status of the ports that cable our network together, will “walk” you directly to the problem. In my few years of nagios experience, there have been many many times when all that was needed was to reseat(unplug/plugin) a network cable, and it would be fixed. Without nagios parent/children, it would take a very long time to find that bad cable.

Another thing about our network is that it is redundant. From switch-A to Switch-B, we have 2 cable’s that connect them. This causes a network loop. The network loop is broken by the switch by using spanning tree, which blocks one path. Adding a snmp check for the port’s will show you if a port changes from it’s “NORMAL” forwarding or “normal” blocking state. Thus giving you a clue that something has changed, that should not have. Now even that check turned out to be invaluable. Many of our networks are extreamely complex, with VPN’s, routers, etc, and it was discovered that something on our newtork was causing spanning tree to “rethink” which switch/router is going to be the “root bridge”, thus causing a network burp. So after finding out that VLAN’s don’t keep spanning tree seperate, we had to remove some VLAN’s on a switch, and put them on there own switch.

I’m telling you this, because much can be discovered about your network, and it can all be mapped out by nagios, if you just take the time to console into each and every switch/router, and see just how it’s configured. NAgios can then be configed to show you every vlan, every switch-to-switch cable connection, router path’s.

When you get just 50% done with the above, your boss is going to give you much praise. But when you get 90%, perhaps it’s time to ask for the raise, since he now knows you have the entire network, not just a bunch of hosts all thrown together, all mapped out, and being checked every 5 minutes.

Well, I’m on a roll here and I can’t stop now, sorry.

The switches we have are snmp enabled, and after looking at all the info available by telnet, or a “device manager” gui, I decided I wanted to know the status of the fans/power supplies/heat/etc of each of these devices.

Quickly I found the need for a “MIB Browser”. You can get a free one from kill-9.org/mbrowse/ that works just fine.

Not long after setting up another 50 or so checks of these items, nagios discovered an upper power supply problem, in a router that has 2 power supplies. Now think of the mess we would have had, if months went by, and now the other power supply died. Oooops. But nagios saved the day, and we quickly repaired the busted one.

Another tip is to try and tune your service checks for maximum speed. Use check_fping instead of check_ping.
Use check_snmp instead of check_ifoperstatus.
Include the -m switch with check_snmp and specify the MIB to use, so it’s quicker. Go ahead and compare the 2, and you will see that it’s a hair quicker.

Lastly, if you have around 298 host’s like we have, then nagiosmap is going to be a real “time liller”. When you first fire it up, there is no host coordinates, so nagiosmap just kinda lumps them all in a ball, which resembles fishing line that is all tangled up. I’d suggest to work on the fishing line mess during times when you are bored stiff, but it can actually be kinda theraputic, like knitting. Just drag the host’s around, untill you finally untangle the mess, and now you can begin to organize the host’s the way you like them. Actually, the way Nagios draws them when using the balanced tree, and others, is technically the way it should be drawn. Since that view will show you who is a child/parent more easily, and will make it easier to find the broken cable/host. This becomes important when you get a network problem that in turn makes most of your devices not reachable. Nagios starts to get bogged down, starts reporting all kinds of errors, and it’s kinda hard to locate the exact problem, without a good status map.

I personally, can’t live withoiut a good status map, during times of “Oh man, the whole network looks like it’s down”. I just calmly go to the map, and say “there is your problem, we’ll have it fixed in 5 minutes, so take a pill”.

Thank you, jakkedup! From your comments I infer that the time I’m taking now will be well spent; you seem to have gained a lot from careful planning.

I’ve started drawing out the child-parent relationships, and will check out the MIB browser you reccommended. Aside from our network infrastructure gear, the majority of hosts we’ll be monitoring are Dell Servers. Like them or leave them, here’s a link to an article regarding ‘omserver’ integration with nagios for monitroing Dell HW: dell.com/downloads/global/power/1q04-sch.pdf.

NagiosMap definately makes the layout easier - thanks for that pointer too, folks.

I’ll keep checking for any other organizational tips, and I’ll post back if I ‘discover’ any especially useful time-saving organizational tricks as I move forward.

Correction:
… ‘omreport’ integration with nagios for monitroing Dell HW: dell.com/downloads/global/power/1q04-sch.pdf
Edited Sun Mar 13 2005, 07:14PM ]