Monitorize fisical switches with one logical structure

bones!

im trying to monitorize a huge lan. for most of the machines i just use check_host, but there are some nortel switches that i cant go that way. the problem is that several switches are linked with its special port, this make connection redundant and if one fails the others take the load. this configuration works pretty good, in fact the problem is that if a switch fails no one notice it!!!

so all appear as one logical element, with just one IP. i want to monitorize single element failures, so if a switch goes down i get an alarm on nagios. im trying with snmp, but i cant find the proper OID.

anyone with similar situation can help me?

thanks in adnvance

on the nortel website you should find some lists… but it can be quite painful anyway… missing a list i had to scan almost only by trial the OIDs of a sun 10000… :confused:

Should you go by trial you could attach a machine with a known IP on one port and look for that IP address making an

snmpwalk | grep ‘$IPADDRESS’

Hope it helps… Luca

PS: have a look in the forum i think Jakkedup posted a lnk to a tool for snmp searches some time ago.
Edited Wed May 18 2005, 06:33AM ]

Switches are another of my favorite topics, so I’m going to elaborate a bit. This is how I deal with your trouble of “redundant paths to the same switch”, which is exactly how our network is setup also.

First get yourself a couple tools. i.e. Mbrowse is an snmp mib browser at kill-9.org/mbrowse/ and Java Device Manager (Java Device Manager v5.7.9.0 for BayStack 5510 10/100/1000 Switch v4.0.0 and v4.0.1) is a Nortel switch configuration tool available from nortel website. You will use mbrowse to look at a switch/router/any snmp enabled device and view the vast amount of information that is available to you. You can use Device Manager to also view status of your switches. Anything that you can see in the device manager can also be viewed by using mrowse.

The Nagios plugin that you will be using is check_snmp ONLY.

My Nagios PC is connected to a switch (switch A) via it’s eth0 port. I want to show this port on my status map and also it’s status. The status of this port has a oid of “.1.3.6.1.2.1.2.2.1.8” and the expected value configured in nagios is “1” which means the port is up. Do not use check_ifoperstatus (it takes longer than check_snmp. So the nagios check command looks like this:
check_snmp!public!.1.3.6.1.2.1.2.2.1.8!1!ifOperStatus!RFC1213-MIB
The check_snmp command definition looks like this:
command_line $USER1$/check_snmp -H $HOSTADDRESS$ -C $ARG1$ -o $ARG2$ -r $ARG3$ -l $ARG4$ -m $ARG5$
If you don’t use the -m switch, then the command takes a huge amount of time, since it will then search through every single mib installed on your Nagios pc in /usr/share/snmp/mibs/, so make sure you have them installed.

Now that we know what the status of the NIC is, we want to know the status of the IF that is on the switch, so use the same command, but change the oid to the correct one for that port on the switch. When you config the hosts.cfg, make the parent for the switch port “eth0” and the switch itself (use fping for the switch itself) has a parent of the switch port. YES, each one is an individual host and here is why. That way, when you look at the status map, you will see a line from Nagios, to the eth0 NIC, a line to the switch port and then a line to the switch. Anyone can now remove all the cables from your pc and the switch, and you will know exactly how to cable it back up, just from your NAGIOS setup (how cool is that?).

Now on to the redundant part of the network. The switch is connected to another switch, which is connected to the router. We want to show the status of these connections, and we want to show every port involved (as an individual host, for the same reason as above, cool huh?)
Switch A connects to Switch B via 2 fiber connections (or in your case, maybe 2 copper ports). We use spanning tree to block the one fiber connection so that there is no network loop. You could also connect them by using a multilink trunk. We want to show that A is connected to B with 2 cables, so create 5 more host.cfg entries i.e. SwitchAPort23, switchAPort24, switchBport1, switchBport2 and SwitchB.
config services.cfg to check the “ifstatus” of each port, using the correct oid for each port. For the switch itself, simply use check_ping or check_fping for the service. “Obviously the switch must be up if the port is up, so why are you bothering to ping the switch every 5 minutes” you ask? Well, I want to show on the status map, every important connection, every switch. That way, someone can rip all the cables out of our entire network, and I can cable it backup identically to the way it was.

Status map will now look like this:
Nagios–eth0–SwitchAPort1–SwitchA–SwitchAport23–SwitchBport1–SwitchB
REMREMREMREMREMREMREMR–SwitchAport24–SwitchBport2

Parents for the hosts.cfg go like this:
eth0 no parent (so it default’s to the Nagios process)
SwitchAport1 parent is eth0
SwitchA parent is SwitchAport1
SwitchAport23 parent is SwitchA
SwitchAport24 parent is SwitchA
SwitchBport1 parent is SwitchAport23
SwitchBport2 parent is SwitchAport24
SwitchB parent is SwitchBport1,SwitchBport2

You should be able to see the power in this. How many people have a complete diagram of there network all laid out in Nagios? How many people could remove every cable from there network, and then recreate it from there Nagios “status map”? Not many I’m sure.

Now that you have all of these ports and switches configured, you can move on to more details about your switches. If you are using spanning tree to block one path (as we are) we want to know if that path ever changes, so we use check_snmp and query the STP spanning tree status to see if it’s blocking or forwarding. If the port is not in it’s “normal” status, then we get a warning from nagios, that the port has changed stp from blocking to forwarding, or visa versa.

You could add check_snmp to see the status of the 3 fans in each switch and the powersupply status. On your routers you can use check_snmp for the fans, redundant power supplies, cpu usage, and stp changes (spanning tree topology changes). If you have vlans on your switches, you want to show them on your status map, so when you look at it, it makes some sense as to why hostA is connected to vlan-Business and hostB is connected to vlan-accounting. Use VlanROWStatus to get status of vlan. On a switch, you can actually have a vlan configured, but not enabled, so this gives you a method of how to show it in nagios status map.

Put as much details about your network connectivity as you can, and I will guarantee, that you boss will be amazed that someone finally has a network diagram. Not only that, but it tells you when it’s broken. Nagios will now show you which cable the electricians cut, or which cable some tech unplugged by mistake, since the status map will show everything “OK” up to the break, then the rest will be “unreachable” If you have setup your hosts.cfg file as I have described, the only device that will show “DOWN” is the first device in your network chain that will not “check_snmp”. If you have configured your notifications correctly, the only one you will get will be the “DOWN” one, since I don’t care about any “Unreachable” errors.

Edited Thu May 19 2005, 07:45AM ]

The reason I show the status of every port used to connect switches to switches and then to a host, is due to this one reason, speed in troubleshooting.

I joined the “network” group of techs and within 2 weeks, I got a call that “I can’t get to the internet”. So after 45 minutes of working with the tech who setup the network, we found the problem. We pinged this switch, that switch, this router, that router, and it was a nightmare. So I found some Nagios program, and my life has been complete ever since. I mean it, it’s amazing. It now shows every port that is in use, to connect switches to other switches, to routers, to hosts… Someone can remove every cable from every switch and I could wire it back up, just from my Nagios “status map”.

It’s not good enough, to use Nagios to only show that host A is running and it’s ftp service is “OK”. If that device shows that it’s “down” but in fact, it’s “up” then why? Perhaps it’s because the cable on a switch is loose (which was my first experience). I only had to unplug and plug the cable back in, and the switch port status light went back on. ** We need to know if a cable/port is broken, not just if a host is running ftp service.**.

As your network grows, add each and every connection to your Nagios. Demand that all the techs and people involved realize that they can’t go around and unplug cables, move cables around, without your knowledge. Nagios is watching all of this stuff, and you want your network diagram (Nagios status map) to be accurate and up to date.

In case you haven’t figured it out, I love my Nagios, I love my status map, and so does my boss.

wow!

that was impressive. you really love nagios, dont you? :slight_smile:

well, thx a lot for your answers. by the way, i found exactly what i was looking for just few hours after posting this.

the problem (im not sure if it was clear enough) was that i have only one ip for 4 stacked switches, so i can’t use fping for every switch. finally i’ve found an OID that cover that:

s5ChasComOperState (.1.3.6.1.4.1.45.1.6.3.3.1.1.10)

checking this via snmp i can know the state of every phisical switch. also has entrys for fans, power supplies, etc.

anyway, thanks a lot for your guidelines to monitorize a net, i’ll try to follow it as closely as i can.

see you!

Which one are you using?
The complete oid is ???

24 10/100BaseTX plus 1 MDA slot and 1 Cascade Slot"
24 10/100BaseTX plus 1 MDA slot and 1 Cascade Slot"
“1 port 1000Base-SX, 450-1SX MDA”
“24 10/100BaseTX plus 1 MDA slot and 1 Cascade Slot”
“BayStack450 Switch Family Unit Power Supply”
“BayStack450 Switch Family Redundant Power Supply Feed”
“BayStack450 Switch Family Unit Power Supply”
“BayStack450 Switch Family Redundant Power Supply Feed”
“BayStack450 Switch Family Unit Fan”
“BayStack450 Switch Family Unit Fan”
“BayStack450 Switch Family Unit Fan”
“BayStack450 Switch Family Redundant Power Supply Fan”
“BayStack450 Switch Family Unit Fan”
“BayStack450 Switch Family Unit Fan”
“BayStack450 Switch Family Unit Fan”
“BayStack450 Switch Family Redundant Power Supply Fan”
“24 10/100BaseTX plus 1 MDA slot and 1 Cascade Slot”
“24 10/100BaseTX plus 1 MDA slot and 1 Cascade Slot”

I’m not sure why you would use that oid. To me, a stack is just one big switch with one ip. What I do care about is, is that all the fans are running, the power supplies are good, and that the ports used are operational. I’m not sure exactly what oid you are looking at, so I dunno.

hi again!

im not on the customer site by now, and i dont remember exactly what oid is.
but im sure is on the s5Cha mib. if you search a little bit, you’ll find s5ChasComOperState.

this oid indicates if one PHISYCAL switch is up or not.

and as i said, you’ll also find on this mib several usefull entrys to check hardware status of nortel switches.

see you

hi!

jakkedup, i got an specific question. you said:
“we use check_snmp and query the STP spanning tree status to see if it’s blocking or forwarding”

which oid do you use to check that?

thx

check_snmp!private!.1.3.6.1.4.1.2272.1.13.5.1.4.16.1!5!StgPortState!RAPID-CITY

$USER1$/check_snmp -H $HOSTADDRESS$ -C $ARG1$ -o $ARG2$ -r $ARG3$ -l $ARG4$ -m $ARG5$

Edited Thu Jun 02 2005, 09:22AM ]

Just wondering… if you’re running a check_snmp for every single switch port, what’s the load on your Nagios host like? What kind of machine are you running it on?

Not every switch port, sorry if I missled you. But every port on a switch that is important. For example, the fiber port, port 1 since serv1 is in that port, port 23 since it is used to trunk to another switch, etc.

By making hosts.cfg entries for each “important port”, when you look at the nagios status map, it actually shows you that server1 is plugged into switch3/port13 and so on. In other words, my nagios status map actually looks like the network itself. Unplug every cable from every “important host” and switch, and I can recreate the network from the nagios status map.

Now, if I made one host entry for switch1 and had all the port checks as services for that host, then the status map would show server1 connects to switch1, but what good is that? I wanted it to show server1 connects to port1 on switch1. I hope you see what this has done.

here is the gkrellm.


I think I get you. You’re defining the switch ports as hosts, defining the check_command to use check_snmp and as a result they only get checked when there’s a problem downstream, correct? That’s quite clever. Do you have services defined on these “port hosts” as well or only have host entries?

You got most of it, but not quite.
I have a service check for EVERY host. For ports, I’m interested in the ifoperstatus(is the port up or down). 99.9% of the time, if the ifoperstatus of a switch port is labeled down, it’s due to the cable being cut/unplugged/etc. I’ve never seen a port simply go “bad” and stop working. Also I’m interested in the STP port status(spanning tree port status). We use spanning tree on the switches to prevent network loops from occuring. If someone makes a network loop, then spanning tree will block that port and cut them off.

So for example:
In our network we have deliberately made redundant connections from switch to switch. Spanning tree blocks the redundant path, and the loop is broken. If an electrician or someone cuts/unplugs the “live” fiber cable, then spanning tree opens up the other path, and everything is cool. But since everything is still working, we would never know that one path is broken, until the last path is broken also. Therefore, we need to snmp query each path for it’s spanning tree status. If it is normally blocking, then that is what nagios looks for. If it’s normal status is forwarding, then that is what we config nagios to look for.

Every network connection from switch to switch to router to a server and it’s port are monitored every 5 minutes for the interface operational status and spanning tree status (in the case of switch ports).

Along with the port checks, we perform checks on the routers/switches fans, power supplies, temperature, etc. In the case of routers, we check the spanning tree topology changes, since there are times, when someone screws up, and connects networks that are supposed to be isolated by a router and we end up with more than one spanning tree root bridge, which makes the networks screwy and fight over who is going to be the boss(root bridge).

Bottom line is simple. If your nagios status map doesn’t look exactly like the network that you have, then you are not going to be able to find problems quickly.

70% of our problems have always been with some cable in some switch/router. Seldom do we have a unit just die. So if most of the failures are cables, then we need to find those problem cables quickly. Check_snmp will solve that problem and point you to the exact switch and port (if you have setup your parent/child relationships correctly). Every device/host will show “unreachable” or “undetermined” and only ONE host (port) will show as being “DOWN”. Looking at the status map, you can now easily see just where the break is at.
Edited Fri Jun 03 2005, 06:26AM ]

Is your nagios status map publicly admirable? 8)

Sure, but it’s kinda small. The live one is not on the net. :frowning:

Shucks, I see we can’t upload images any longer. Oh well.