SNMP and Nagios

tekhed · September 8, 2005, 2:35pm

Hello -
Currently I have Nagios monitoring our linux servers and sending out e-mail notifications as well as paging. I’m trying to get some information from our Cisco routers to work within Nagios and am at a loss.

I have MRTG set up on another host that uses SNMP to gather the information from our Cisco devices. But as you know, MRTG just graphs the data for historical purposes and not for alerts. I’ve read the Nagios docs on SNMP traps. Kinda confused about what that’s telling me to do.

My understanding is that Nagios can monitor any script that’s created to query a device/service. So this would mean that I would need to have a script that can walk the oid(s) I want to look at, evaluate the data they return, print out an appropriate message, then exit accordingly.

Sounds simple enough but it’s not working. From the Nagios host, I issue the ./check_snmp command followed by the host I want to query, the community string for that host, the OID, followed by the warning and critical ranges. I get an error message that there is “NO DATA RECEIVED FROM HOST”. Here is the command that I used to query a cisco router:

/.check_snmp -H (IP) -o (OID) -w (#:#) -c (#:#)
NO DATA RECEIVED FROM HOST

Not sure why I am getting this message. I do know that on my MRTG host I can issue the snmpwalk and snmpget commands to gather data. I do NOT have those binaries on my Nagios host. I DO have the NET-SNMP package installed which ‘should have’ those tools.

Anyway, if anyone could help, it would be greatly appreciated or atleast point me to the documentation that could help. Thank in advance.

jakkedup · September 8, 2005, 3:35pm

Can you ping the device. Is the device running snmpd. Is the community string “public”? Try using the -C switch and specify the community string. Also, it would be wise to use the -m switch and specify the MIB to use, otherwise, the check will have to search through ALL of your MIB’s installed and will take a VERY long time to execute.

tekhed · September 8, 2005, 3:53pm

I can ping the device. The community string is NOT public, but I’m sure that I’m using the correct password in my command. I have used the -C switch followed by community string of the machine. I haven’t used the -m switch yet. Here is the command that I just tried and the following error message:

./check_snmp -H (ip) -C (passwd) -o (oid) -w #:# -c #:#
SNMP PROBLEM: No data received from host
CMD: /usr/bin/snmpget -t 1 -r 5 -m ALL -v 1 -c (cisco_passwd) cisco_ip:161 (oid) . . .

This seems to tell me that I do NOT have the ‘snmpget’ binary on my nagios host. I checked, I DO NOT. I thought those tools would be included in the NET-SNMP package. Guess not.

I will have to speak with the admin to find out if the Cisco is running SNMPd. I’m sure it is, we are gathering info from that router using MRTG that is hosted on another machine.

jakkedup · September 8, 2005, 4:16pm

Use find or slocate to see if you have snmpget or try
which snmpget.
If you are already pulling snmp data from this host for mrtg, then yes, the host is running snmpd. How else could you be getting the data if you it was not? But anyway, paste your oid or double check it, to make sure it is valid.

tekhed · September 8, 2005, 4:32pm

OK, I spoke with the admin and YES we are running SNMPd on the cisco device. I simply copied the OID from the MRTG.cfg file for CPU utilization. Here it is: 1.3.6.1.4.1.9.2.1.58.0

Ofcourse in MRTG it’s like this: 1.3.6.1.4.1.9.2.1.58.0&1.3.6.1.4.1.9.2.1.58.0

I’ve tried them both, still the same error. I also tried to find the snmpget and snmpwalk commands on the Nagios host. Nothing.

I’m assuming that this means those tools are not installed and that’s what the ./check_snmp plugin is attempting to use in its request and therefore returning error. If this assumption is correct, my next question is why does my NET-SNMP package:
net-snmp-5.1.2-11, not include those tools?

tekhed · September 8, 2005, 6:56pm

Problem solved. I went to the RedHat Network and downloaded the net-snmp-utils package for my distribution. Then, verified that I had the snmpget and snmpwalk tools for my ./check_snmp plugin to work.

I issued the command again from the Nagios Host and received an SNMP OK exit code. That problem is solved.

Now, when I go back to the web interface of Nagios, I am getting a critical state for check_snmp service against that machine and an error code of 127 out of bounds.

Anyway, one problem is solved, moving onto the next. Progress has been made so I’m happy for now. If you have any input as to why I would be getting the error on the web interface side, I would appreciate your thoughts. Thanks.

jakkedup · September 8, 2005, 7:19pm

su - nagios
and then run the command by hand with the -v switch. What do you get?
nagios user might not have snmpget in it’s path. Also, you will have to recompile the check_snmp command itself, since you did not have snmpget installed. I’m surprised that we didn’t figure that out during the compile.

tekhed · September 8, 2005, 8:34pm

ok, i’ve switched to the nagios user, issued the command with the verbose switch and it returns SNMP OK. The same thing as the root user.

That tells me that the command is in the path of the nagios user. When you say “recompile the check_snmp” command, do you mean re-install the nagios plugins? Because that came with the nagios-plugins package? Please explain what you mean by “recompiling the check_snmp command”? Thanks.

tekhed · September 8, 2005, 8:46pm

Also, we know that the ./check_snmp script is in fact working because it’s giving an exit code of 0 , or OK. This happens for both the root user on Nagios as well as the nagios user.

It appears that the problem is NOT with the ./check_snmp command but rather the output from that script that is returned to Nagios. When ran from the Nagios command line to poll the Cisco device, I get an SNMP OK AND an integer. That integer from the command line tells me “all is well” but Nagios web interface is probably expecting a different ‘type’ of return code from the ./check_snmp command. Therefore spitting out an error, or Critical state, when, in fact, ALL really is well with the device. Maybe you could shed some light on that theory of Nagios expecting a certain type of return code and getting something different since this is a Cisco device. Thanks.

jakkedup · September 9, 2005, 1:28pm

Most plugins, and this one also, are compiled from source code. If you compiled the plugins on a machine that had snmpget located in /usr/bin for example and then moved the file to a box where snmpget was in /usr/local/bin, the plugin would not work. Seeing as how you stated that you had no snmpget command on your system, I figured that now that you do have it, you will have to compile the plugin’s again. That is done by doing ./configure make make install per the plugin readme file.

It wouldn’t hurt to do this, since you have recently installed NET::SNMP. Try it and see how it turns out.

tekhed · September 9, 2005, 2:09pm

I’ve checked into how the plugins were installed. They were installed via RPM and not compiled from source. Would you suggest that I uninstall the package and then re-install?

I’m not sure that would make any difference. Maybe I didn’t explain this well enough, but when I issue the ./check_snmp script, it looks for /usr/bin/snmpget to execute. This (before I had snmpget on the bo:evil: would return error code as I mentioned earlier in the post. Then after getting the snmpget binaries installed, the command WORKS from the command line.

My thought is, that if the plugins needed “recompiled” or re-installed via RPM, this would not change the fact that the web interface of Nagios is not displaying the proper output.

When I run the command from the Nagios Host command line like this:

./check_snmp -H (ip) -C (passwd) -o (oid) -w #:# -c #:#

I get the following return output:

SNMP OK: 41 (checking cisco load)

Since this is an “OK” or “0” exit code, meaning that it is working, wouldn’t that mean the problem is with the web interfaces interpretation of the output of the executed command?

jakkedup · September 9, 2005, 3:19pm

Since the command works from the command line, then a recompile should not be needed. The problem is not how nagios is interepreting the output, it’s because it is haveing trouble running the command itself. This is an identical problem that people have with the check_ping command, “return code out of bounds”.

nagios.org/faqs/viewfaq.php? … desc=false

So I suspect that the problem is incorrect permissions on the plugin.

ls -la /usr/local/nagios/libexec
What do you get?

Edited Fri Sep 09 2005, 04:19AM ]

tekhed · September 9, 2005, 9:21pm

Ok, here is where I am now. I’ve made some progress as I found a typo in the command definition for ./check_snmp. I had $USER$ as my variable as opposed to the correct $USER1$ variable needed.

Once that was changed, I was no longer getting a CRITICAL error. However, I am getting another error (progress ;p) which is now telling me: check_snmp: Invalid warning threshold: %s

Ok, so this tells me that it doesn’t like the percentage output that it received and therefore cannot display anything other than error.

Here is how the command is defined in the config:
$USER1$/check_snmp -H $HOSTADDRESS$ -C $ARG1$ -o $ARG2$ -w $ARG3$ -c $ARG4$

Now, here is how it’s defined to check that device:
check_snmp!host_ip!passwd!1.3.6.1.4.1.9.2.1.58.0!50!70

OK. That sums up how it’s configured. And, when I run this from the command line on Nagios, I get a return of OK. Here is the command and the output from the command line:

./check_snmp -H host_ip -C passwd -o 1.3.6.1.4.1.9.2.1.58.0 -w 50 -c 70

SNMP OK: 43

SO, this proves that it works from the command line. And you can see that it matches the configuration. You would think that the output on the web interface side would be the output from the command line (SNMP OK - 43). It’s not. I get the error mentioned above. (check_snmp: Invalid warning threshold: %s)

So, I do this: ./check_snmp --help

I see that the -w switch is looking for a range. You can specify a bare integer such as 40,50 etc…this will be interpreted as the max value before a warning is sent to output. OR, you could issue a range separated by colon like this 10:50, meaning that if the value is outside of that range, warning will output.

I’ve tried both options from command line (bare integer as well as range) and get SNMP OK - # both ways. So, we know that works.

So, I changed the configuration to use these two types of -warning checks, still get the same thing from the web interface. (check_snmp: Invalid warning threshold: %s)

My question is, what is Nagios expecting in return from the ./check_snmp command? AND, why would it NOT report the SNMP OK - # output that it does from the command line?

jakkedup · September 10, 2005, 12:51pm

Nagios is not expecting anything, it’s simply displaying the output of the command that it runs. I suspect that the command defined in services.cfg is not exactly like the one you run by hand. Check your checkcommands.cfg for accuracy and also your services.cfg.
When I run the command using -w 5.0 I get the same error, so the value you have may not be an INTEGER.

tekhed · September 12, 2005, 1:52pm

Well I thought that is what I showed you in my posts. I’ve showed you the checkcommands.cfg for that which is:

define command{
command_name check_snmp
command_line $USER1$/check_snmp -H $HOSTADDRESS$ -C $ARG1$ -o $ARG2$ -w $ARG3$ -c $ARG4$
}

And here is the services.cfg for that:

check_command check_snmp!host_ip!passwd!1.3.6.1.4.1.9.2.1.58.0!50!70

So basically, all that I do when I run the ./check_snmp script from the command line is replace the exclamation marks in the services.cfg you see with the command switches you see in the checkcommands.cfg. From the command line, it works fine.

When I run the verbose option from the command line with ./check_snmp, it tells me this:
SNMPv2-SMI::enterprises.9.2.1.58.0 = INTEGER: 39

SNMP OK - 39

Is there a specific way I should be defining this in the services.cfg? I have it specified as:

check_command check_snmp!host_ip!passwd!1.3.6.1.4.1.9.2.1.58.0!50!70

Should the last two options be different? That’s where the problem seems to be. Thanks.

jakkedup · September 12, 2005, 2:29pm

Your nagios configs look good to me. ARe you pasting those from your configs to this forum, or are you typing them by hand? There must be something wrong in the configs, but I can’t see it. Is $USER1$ defined correctly in resource.cfg?

tekhed · September 12, 2005, 2:58pm

Yeah I am posting those right from the configs. Only thing I am changing is the ip/passwd to the box. I’ve also just went back in there to double check the configuration to make sure. Here is how $USER1$ is defined in resource.cfg:

Sets $USER1$ to be the path to the plugins

$USER1$=/usr/local/nagios/libexec

And keep in mind, I’m monitoring other things and they are working fine. So we know this path is correct and that other scripts are working.

I’m confused on this one.

jakkedup · September 12, 2005, 3:23pm

Then ls -la /usr/local/nagios/libexec and make sure that check_snmp has the same permissions as the rest.

tekhed · September 12, 2005, 3:30pm

Yes the permissions are all the same for the plugins in that directory including the check_snmp plugin.

jakkedup · September 12, 2005, 3:59pm

OK, last resort, configure a check_snmp2 command and make it similiar to mine. This is running the command as root, by using sudo. If you don’t know how to setup sudo, then we can address that also.

define command{
command_name check_smb_shares
command_line sudo /usr/local/nagios/libexec/check_smb_shares.pl -H $HOSTADDRESS$ -S $ARG1$ -U $USER3$ -P $USER4$-A $ARG2$ -w $ARG3$ -c $ARG4$