Adding switches

luca · October 14, 2005, 2:05pm

Hi all,
as it appears i’ll soon have to monitor the whole network part of our two web farms.
THis measns the complete network part inclusive switches and so on… having finaly gained access to the switches SNMP passwords i’m starting to play around an trying out what i can get out of them…

My main concern is having to add the whole lot of data… it’s some 10-12 switches most of them with at least 24 ports… each ports get’s polled for 4 or 5 services… that’s a LOT… even polling only every 30 minutes or so… (no problem for resetting counters… MRTG is doing it’s job already) but mainly i see a problem writing the whole lot into the cfg files… has anybody some sort of script ready to generate the required lines?
And what do you monitor?
I’m thinking of getting out data for OperStatus, speed, duplex mode, description… anything else?

Thanks in advance, Luca

jakkedup · October 14, 2005, 2:57pm

It’s about time, but 10-12 switches? Bah, nuttin. I"m monitoring every switch in our company which is alot more than that.
So, let’s get this done shall we?

First, why check speed, duplex, description, etc. Those values never ever change unless someone deliberately changes them. When setting up a host on a switch, the first thing you should do, is to make sure they are getting good troughput. If you see that the switch is autonegotiate/full duplex/ and the host is the same, but they are only getting 2mbits/sec transfer rate with a 100mbit card/switch port, then turn off autonegation on switch and set to manual 100/full. Now test again and they will be getting maybe 70-80mbits/sec.
Operstatus is vital. Do not use the check_ifoperstatus plugin, it’s much slower than a simple check_snmp with the correct oid values.
Always use the -m switch with check_snmp. This tells the plugin to use THIS MIB. That way, it’s much faster to execute.
For example:
check_snmp!public!.1.3.6.1.2.1.2.2.1.8.2!1!ifOperStatus!RFC1213-MIB
$USER1$/check_snmp -H $HOSTADDRESS$ -C $ARG1$ -o $ARG2$ -r $ARG3$ -l $ARG4$ -m $ARG5$
Why check every switch port status? Since I’ve talked you out of speed, duplex, etc. hopefully. Not all ports are used, not all ports are being used by IMPORTANT hosts. Some ports will be used by customers and they will be going up/down all day long every day.
It is imperative that all people know to NEVER EVER make a cable change on switches without you knowing about it. This will become clear later.
Monitor iftraffic like this perhaps:
check_if_inout_octets!public!50000,50000!80000,80000!.1.3.6.1.2.1.2.2.1.10.72,.1.3.6.1.2.1.2.2.1.16.72
/usr/local/nagios/libexec/check_snmp_counter2 -H $HOSTADDRESS$ -C $ARG1$ -w $ARG2$ -c $ARG3$ -o $ARG4$ --desc IN,OUT --isoctets --inkilos
Now the good stuff. First define your nagios pc as a host with a service of check_fping. All of the following will be identical, so you can just copy/paste or use a template. Make it wthout a parent. Without a parent, in the status map view this host will have a line to the “nagios process”. Now define your nagios pc’s eth0 card as a host and service check of check_snmp for ifoperstatus. It’s parent host is nagios PC host defined above. Next define the port in the switch that nagios pc connects to as a host with service check ifoperstatus. It’s parent is the eth0 host on nagios pc. Now define the switch as a host, with service check of check_fping and perhaps any others like load on switch. It’s parent is the port defined just previously on that switch. Now define all important ports on that switch as a host, with a parent of the switch. Next define the host’s that connect to that siwtch (there eth0 card) and finally the host itself.

Summary, what you have now in your status map, is a complete physical layout of your network hardware. Every port with an important host is defined and shows the connectivity. If a cable becomes unplugged, you can instantly know what cable/port on a switch is broken. This has actually been a godsend for us here. Many times, a cable is not broken, but simply needs to be reseated into the switch.

Now continue on with the rest of the network. How does that switch A, connect to switch B. Define the ports used to connect the 2 following the logic above. Where does your router connect to the network. Make sure you define the ports on the switches used to connect this router, and also the ports on the router. You may have to search the net to find the oid/mib for your router, but the standard ifoperstatus may work also. But there are lots of goodies in the routers you will want to check.
On our routers I can check:
upper and lower power supply status, fans rpm, temp inside case, fan exaust temp to name a few.
Remember, now that nagios’ status map shows that hostA connects to switchC via port 12, then obviously, NOBODY should ever just change this, without you making changes to nagios. Otherwise, your status map will be inaccurate. This is not impossible and well worth the effort. All techs have been given the riot act and KNOW that they can no longer just move stuff around for the heck of it.
I’ll post more as I think of it. or you need more.

jakkedup · October 14, 2005, 3:41pm

test Forum was acting weird, so delete this if you want.
Edited Fri Oct 14 2005, 07:38AM ]

jakkedup · October 14, 2005, 3:43pm

Here is a pic of one switch connected to another switch. On almost EVERY switch stack (many switches connected together with a cascade cable to look and act like ONE BIG switch), we utilize 2 fiber connections to connect to each other. They utilize spanning tree, so if one fiber fails, then spanning tree switches over to the other fiber. We monitor the spanning tree status with check_snmp for it’s NORMAL status (either blocking or forwarding). If this spanning tree status ever changes, we get a nagios error. This has proven VERY important. Spanning tree was going crazy on our network changing many times per minute. The customer’s hardly felt it, but nagios saw it happening. The problem was 2 vlans’ on one switch from different networks, and spanning tree ignore’s vlans. So both networks where fighting for the “root path” title. To fix that, you just turn off spanning tree for those ports only, that connect the switch to the other network. I’m telling you this, so that you find out how your switches connect to each other. 2 fiber’s with spanning tree or MLT(multi-link-trunk), perhaps 2, 3, 4 copper ports using spanning tree or MLT’s. Either way, you would define them as hosts accordingly.

You may find out like I did, nagios looks at this like a circular path, and won’t allow it. So I had to lie to nagios and tell it an innacurate parent/child relationship. Always lie on the path that is spanning tree blocking. If you are using MLT’s then no liing is needed. Define the parents logically.

Important, adopt a naming convention so people can read this.
At our company, we have many networks all connected together with routers, vlans, etc. So the first part of the switch name is it’s network. Part 2 of the name is the physical location (hub closet name maybe?) the next is the switch # (if you have several switches in the same closet you can number them). Next is the switch stack unit # and finally the port on that switch.

So in my example pic, the switch is on the “sparc” network, it’s location is in the AS400 room, it’s the only stack there, so there is no “stack #”, now that they have found the switch stack, the next # is the unit in the stack, then finally p#( the port # on that switch).

As you can see, Sparc-AS400 switch stack connects to Sparc-OHR switch stack via 2 fiber connections. One is utilizing unit 1 port 25 to connect to unit 1 port 25 on the other. The other connection is normally blocking and is utilizing unit2 port 25 to unit 2 port 25 on the other.

If this was a MLT, you could add checks to make sure that each MLT is operational with ifoperstatus simply. In my case, I have to use ifoperstatus to make sure all cables are connected, but also a spanning tree check to make sure one path is forwarding and the other path is blocking (they will never change unless there is a problem, but you can’t ignore the problem, since that indicates a cable failure or maybe spanning tree going wacko on your network).

Edited Fri Oct 14 2005, 04:57AM ]

jakkedup · October 14, 2005, 4:01pm

Here is a pic of one host, it’'s interface card, and the port used to connect that IF card to a switch. Switch is not shown, since it’s too far to the left, but of course, the other line on the switch port travels over to the switch itself. By defining parents logically and defining a nic card and a switch port as hosts in nagios, with correct parent/child relationships, ANYONE can disconnect every cable from your network, and now YOU, can cable it back up the EXACT SAME way as it was originally. Without this, you would have to have a network schematic showing every single connection, which you most likely don’t have.

But since you are now going to start monitoring switch ports, please do it right and get this status map accurate. Do not let people make network changes without fixing nagios also.

Edited Fri Oct 14 2005, 05:05AM ]

jakkedup · October 14, 2005, 4:29pm

OK, now let’s get really creative here. In the following, I haven’t added the eth ports on the hosts yet, nor even the switches it connects to. But they could easily be done.

What it does show, is a sort of “service dependancy” but in a graphical way, in the status map. Not only is the dependancy shown graphically, the parents are defined in such a way as to being logically true.

So we have 2 Oracle hosts, Bus-dnmacdb1 and db2. They are running oracle rac so when customers connect to the database, they really don’t know what host they are using or what path and the cpu loads are balanced.

In order for Oracle to start up, the first thing we need is either Bus-dnmacdb1 or 2 to be up and operational as a plain Solaris server. Add all of your Unix type checks to this host. i.e. discspace, cpu usage, etc.

AFter the OS is up and running, then we fire up Raid volumes and mount them (add your raid volume checks to this host). Without the raid volumes, there is no disc space that contains the Oracle tables. Each Solaris server has access to the same volume using Oracle RAC, so therefor, you see a connections from each server to the Bus-dnmac-vol (for volumes) “host”

Next we need an Oracle instance to connect to the database (add instance service checks here). There are 3 ways this could be done. A person could “bypass” the load balancing and connect to a “specific instance” by using serv1 or serv2 instance. Or they could connect by using the serv instance (which would balance the load and put them on either one). So, if the volumes are mounted, then we can start up the Oracle instances (lines on staus map shows this dependancy). There is a line from -serv to -serv1 and -serv2 to show that we need either one of those, in order for the -serv instance to operate. Without either one, then there is no -serv instance.

Now that they can connect to the Oracle instances, they probably want to access some tables. So show this as a host called -tables. If the instances are up, then we can now access tables. The service checks performed on the host “-tables” are several. i.e. tablespace free, cachehits, etc. and all being graphed with nagiostat. These graphs have been a lifesaver and have given us many days warning of impending disaster of a tablespace filling up.

** I’m showing you this, to force home the idea that in nagios, your hosts.cfg file does NOT have to contain a real host, but something that could be useful in showing on the status map in a logical way, so people can see what is broken and where more logically.**

In my Oracle example it would be quick to discover that there is nothing wrong with the Sun Box, nor the Solaris OS, nor the raid hardware, nor the Oracle instances, but in fact, it’s simply a tablespace filled up or something like that. But the customer is simply going to tell you, “my Oracle connection is broke and I can’t do anything with it”.

Edited Fri Oct 14 2005, 06:40AM ]

jakkedup · October 14, 2005, 6:42pm

The only thing I monitor on the switch host itself, is the fan’s #1, 2, and 3 status and power supply status.

Other than that, the rest of checks are on the switch port “host” itself. i.e. ifoperstatus, spanning tree status, iftraffic. iftraffic can be very useful if you graph it, since there will always be that trouble call “the network seems really slow” and you will be able to show “no, it’s not the net that is slow”.

BTW, I just had a trouble call “the network is slow”. One of our DS3 interface’s is a T1 speed link, and it was at full bandwidth utiliization. Yes it was slow and the graph’s had shown this. So we immediately knew this wasn’t an outside internet server problem, etc. Without a nagios iftraffic graph with nagiostat, it would have taken much much longer to figure out if in fact the network was being eaten up by one user downloading tons of data. (which is what the problem was).
Edited Fri Oct 14 2005, 10:06AM ]

jakkedup · October 14, 2005, 9:07pm

On routers I found that I can get alot more good stuff from them, like CPU usage, internal case temp, fan exaust temp…

luca · October 17, 2005, 8:21am

thanks a lot
I will keep this bookmarked for a long time

Luca

Clipper · October 19, 2005, 11:52am

Wow jakkedup you’re the man. I understand now why you got all these stars in your profile

After having read your prose, I now realize that I do monitor 100 hosts/250 services, but I do not monitor the switches on which they are connected…

Now a few things are still obscure to me :

Is there any point to create one host for a server t and another one for its ethernet interface if the server has only one ?
A host needs an ip address. What do you put there for the server if you put the IP address on a separate “ethernet interface host” ?
Same question for the “virtual” hosts associated to a switch port, they do not have IP addresses
What is the check_command for all these virtual hosts ?

More globally, could we have some portions of your different config files ? that would be of fantastic help. Let me know if you want to keep this private instead.

Thanks a lot
Clipper :o

jakkedup · October 22, 2005, 2:24pm

Just a FYI, I’m in a production environment, and it takes a long time to find a bad port on a switch or a cable unplugged in a complex network. In my case, it’s imperative to have these hosts online ASAP. In order to do that, nagios is used to find the problems for me.

[quote]

Is there any point to create one host for a server t and another one for its ethernet interface if the server has only one ?[/quote]

The reason you create a host named eth0 is because I/you can disable that interface. If you did not create this host, you could simply monitor that interface as a service under the normal host checks and life would be good. But you might as well go all out, and create the eth0 host and a service check that checks the ifoperstatus. It will LOOK NICE, on the status map is the biggest reason. You do want your boss to be impressed don’t you?

The IP address is the address needed by your service check. So it would be the IP addy assigned to that interface. Any computer can have several interface cards with multiple IP’s/multinetted. So it’s up to you to show this in nagios by creating a seperate host/service check for each and every one of these IP’s. For example: hostA has one IF card. That card has 2 IP’s associated with it (multinet). Show this on the status map. Host A with a line to eth0:0 and also eth0:1. Eth0:0 connects to switch2 port 3 and eth0:1 is also connected to switch2 port3. Since the REAL PC can be accessed by using either IP addy (mulitnet) you would not be able to show this if you did it like everyone else does now could you? By doing it like this, it’s obvious to anyone looking at the status map, that “hey, I can get to that pc by using either one of these ip’s, not just ONE”.

You are right, a switch port doesn’t have an IP, so use the IP needed to allow the service check to run (ifoperstatus for example). But, a router can and many times will have an IP on that interface. In the case of a router, your status map will show all of those interfaces and each one will have it’s own IP addy (we have 2 routers that are like this).

I use check_snmp for every interface. Mainly I check ifoperstatus and sometimes stp (spanning tree status).

Here is a host that is actually a vlan on a switch. A switch can be cutup into several vlans and so we want to show this on the status map. It would look ugly and be confusing to see on the status map, that network A connects to switch1 and also network B does too. How can you have 2 different and seperate networks connected to the same switch? Wouldn’t that make them now the same network? No it wouldn’t, if you have each connection on a seperate vlan on that switch1.
Here is vlan A
host_name Sparc-OHR-NBU
service_description VlanROWStatus-NBU
check_command check_snmp!public!.1.3.6.1.4.1.2272.1.3.2.1.20.5!1!ROWStatus!RAPID-CITY

here is vlan B
host_name Sparc-OHR-MGS
service_description VlanROWStatus-MGS
check_command check_snmp!public!.1.3.6.1.4.1.2272.1.3.2.1.20.1!1!ROWStatus!RAPID-CITY

Both are on the same switch, but the status map will show them as a connection to the REAL switch 1. Then you could show a connection from this vlan to all the ports that are a member of that vlan. How could you show this otherwise? People will now know, “hey, ports 1-4 are vlanA and ports 5-24 are vlanB” Nice!

jakkedup · October 22, 2005, 2:32pm

Here is the ifstatus check that I do. This will tell you if the cable is unplugged, sometimes they just need reseating, sometimes it gets cut by the electricians, sometimes it’s your “BELL telephone company” that disconnects you, and sometimes a port goes bad on a switch.
host_name Sparc-OHR-1p1
service_description IFSTATUS-1p1
check_command check_snmp!public!.1.3.6.1.2.1.2.2.1.8.1!1!ifOperStatus!RFC1213-MIB

But remeber, the big thing here is to show on the status map, the schematic layout of your network. Every cable, can be removed, and you are now able to put it all back together.