NSCA add on - how best to use?


#1

I have the NSCA addon installed to the nagios server, compiled and working on the client, and I can send passive checks throughand they are received.

What I cant seem to do is chain the send_nsca command with the plugins loaded on the box - has anyone done this? and if so, how?


My first ever passive check
Monitor multiple servers with 1 nagios agent
#2

This is how I use nsca.
On the remote hosts:
Complete nagios installs with plugins, minus the cgi when compiled.
In the services.cfg is
obsess_over_service 1

In the checkcommands.cfg is
define command{
command_name submit_check_result
command_line $USER1$/eventhandlers/submit_check_result $HOSTNAME$ ‘$SERVICEDESC$’ $SERVICESTATE$ ‘$OUTPUT$|$PERFDATA$’
}
In the nagios.cfg is
obsess_over_services=1
ocsp_command=submit_check_result

In the libexec/eventhandlers directory is a script called submit_check_result.
#!/bin/sh

    # Arguments:
    #  $1 = host_name (Short name of host that the service is
    #       associated with)
    #  $2 = svc_description (Description of the service)
    #  $3 = state_string (A string representing the status of
    #       the given service - "OK", "WARNING", "CRITICAL"
    #       or "UNKNOWN")
    #  $4 = plugin_output (A text string that should be used
    #       as the plugin output for the service checks)
    #
                                                                                                                                                        
    # Convert the state string to the corresponding return code
    return_code=-1
                                                                                                                                                        
    case "$3" in
        OK)
                    return_code=0
                ;;
            WARNING)
                return_code=1
                    ;;
            CRITICAL)
                return_code=2
                    ;;
            UNKNOWN)
                return_code=-1
                    ;;
    esac
                                                                                                                                                        
    # pipe the service check info into the send_nsca program, which
    # in turn transmits the data to the nsca daemon on the central
    # monitoring server
                                                                                                                                                        
    /bin/echo "$1\t$2\t$return_code\t$4\n" | /export/home/nagios/bin/send_nsca 172.16.14.26 -c /export/home/nagios/etc/send_nsca.cfg

On the Central Nagios server:
Nagios is configured with the exact same hostnames and service descriptions as the remote hosts, but on the Central server the check_command is actually service_is_stale.
Freshness checking is enabled and set to 7 minutes.

So what happens is this. The remote hosts go on about there regular routines of checking all the services defined every 5 minutes. After each check they make, they perform the ocsp command and submit the results as passive checks to the Central server. If for any reason a remote host stops sending updates, then the freshness check will alert you that the service is stale(meaning, find out why the remote hosts’ are not sending data).

Making checks like this is very beneficial since you can now setup literally 1000’s of checks on your Central nagios server, and it won’t even break a sweat. I’m using this method and have over 1200 and still growing. These types of checks are passive checks to the Central nagios server, and active to the remote hosts. If you where to use the nrpe addon, they would be active checks on both machines, and you would not be able to handle as many of these checks, without beefing up your Central nagios server’s CPU/memory/IO horsepower.

Any other questions, let me know. I love this topic. Wish I could make it a sticky.


#3

That seems like a nice tidy solution - except - in this case our production team have specified that they dont want any ‘extra’ daemons running, but they dont mind a cron job being called every minute - ?go figure :cry: ? - so I guess what I need is a script that will execute any plugin, reformat the results suitably for send_ncsa and call it…

Unfortunately my scripting is not that up to scratch. Has anyone already done this? If so I would really appreciate a copy, otherwise Ill be resorting to RTFM for some shell scripting…


#4

Good point and it’s the EXACT same battle I had to fight.
First, the short answer. You either want to be proactive or not. “What style of management are you/they?”

The manager of IT here didn’t want Nagios since he decided that he would “let the customer tell us when something is broken”. But my boss, wanted it for the networks/server’s that we control, so I installed it. We fix things that aren’t even down yet(proactive). We have found problems with the network 5 minutes after they have broken, which would normally take 45 minutes, just to find the broken cable/etc. We have found dead power supplies in redundant power supply systems that would NEVER have been discovered until BOTH power supplies where dead, and then it would have been a crisis. I could name about 50 items that would have cost an hour(or more) of production downtime each, just off the top of my head, saved by Nagios.

Here are more points on how I have overcame that obstacle.

First, download a MIB browser like mbrowse kill-9.org/mbrowse/ Install it, and get an IP addy of just any switch(not a hub) on your network. Put the IP addy in mbrowse and query the status of an interface in that switch using the correct OID, by using the public (read only) default password of “public”. Now show them that you can actually change the status of the port by using the secure (read/write) default password of “secure” and down the port. If they are so concerned about security, then surely they must be concerned about this, but failed to secure the “secure” password. Why?

Another, if it’s simply “no extra daemon’s” running on remote hosts, then open up nmap. Use nmap to scan the open ports on some Solaris server. Ask them why they need “finger” running on this server, or the multitude of other daemon’s that open up ports on this server? If daemon’s are a concern, why do they allow “finger”?

Another:
If you use the nrpe addon, there will be a nrpe daemon running on the remote hosts, and the Central nagios server(and possibly malicious people) will be making connections to the remote hosts. In my opinion, that is not a good thing. By using nsca, the nsca daemon is running on the Central nagios server. Nobody is making any connections to the remote hosts so therefor, it is as secure as it could possibly be. Connections are made from the remote hosts to the Central nagios server. When you install Nagios on remote hosts, you compile it to not use CGI, so there is no way anyone can force nagios to execute a command by using the cmd file in the RW directory, since nagios will also be configured to not perform external command checks in the nagios.cfg file.

But if the only concern is running a daemon and they just don’t have the extra CPU power, ask them if they can actually measure the amount of CPU used by Nagios on that machine, and then ask them if that amount is simply not acceptable. Of course, you will have to configure some “test” box, just like you would do if it where an actual remote host. Grab some junk pc, install Linux on it, install Nagios as I’ve described (and the Nagios docs) as a remote host. Ask them if they can actually tell that Nagios is running, and just how much of a burden is it placing on this box.

A typical Solaris or NT install has many daemon’s and process’ running, that are just not needed. Why is there an snmpd running? Why is there so many open ports? Why do they allow telnet, when it should be setup with ssh? Why ftpd if nobody is using it(some server’s need it, some don’t).

The argument of “no extra daemon’s” is not based in practice, but only prejudice, since they are not enforcing this argument on the non-essential daemon’s they have allowed thus far, like snmpd or finger. Not allowing the Nagios daemon running is like saying, we don’t want to be proactive in finding/fixing trouble, but we do want to be proactive in making sure that nobody install’s anything that isn’t made by Microsoft, or something that doesn’t cost over $5000.

I’ve shown our Nagios application to the head of our department (VP/IT), and he is amazed that I’ve done such a great job, with no money spent. He’s also very upset that the manager of IT spent over $5000 on a HP product(similiar to Nagios), never installed it over a year later, and most likely never will.

We have a boss from the “corporate” parent company tech department, coming to see ME, and look at the Nagios installation here and how we have become a proactive tech shop.

The amount of checks being performed on remote hosts, is negligible (discspace, cpu, process’, etc) and you can’t even measure the amount of CPU time that this daemon is consuming. After all, how much CPU power is spent performing “df -k” or “ps -ef|grep ftpd”. Show them the commands that are performed by using the command line. Show them the source code, and that the command you just ran is nothing more that “df -k” and some extra data calculation.

On the Central server, that performs over 1,200 active/passive checks, of course, this CPU is working very hard and is dedicated to nagios/apache alone.

Bottom line is, I’ve had Oracle DBA’s, Sun/Solaris expert’s, Veritas expert’s, and others and they have all heard of Nagios and they have no concern of allowing Nagios to run on these Oracle, Sun, Solaris, etc server’s. Nagios is not some crap code that barely stays running. I’ve never even seen it crash yet, on the Central or on the remote hosts, and we are queriing many Oracle/Sun box’s, hundred’s of switches, and on and on.

I’m positive, your production team is not much different than ours. So when you have shown them just how lax they are in there policy, since they allow “finger” and other not needed daemons, I’m sure it still won’t make much of a difference. They either have to believe in proactive management, or crisis management, and you won’t be able to sway them either way. This is a management style issue that you are battling now, and has nothing to do with “extra daemon’s”, since they have many allowed now, that do nothing for them.

If all else fails, just setup cron to run the plugin commands every 5 minutes and pipe the output to the submit_check_result command. The nsca client will then pass the data to the Central Nagios server. This is perfectly OK, and is sufficient to gather the status that you want. But what is not OK, is that someone has adopted a “no extra daemon’s” policy, but only enforces it on applications that are beneficial, and not to “not needed” daemon’s.

Good luck, and tread lightly. They just might get pissed off at you, for showing them that they aren’t all that they pretend to be. Pfft, allowing telnet but they won’t allow Nagios? How ludicrous is that? It’s a great boss, who can accept that they may be wrong in there judgement, and actually allow you to be proactive and install Nagios on critical server’s.

Neither you nor I are script writing experts. Why should we have to spend time making scripts for cron, when Nagios is a proven product. It’s like reinventing the wheel?

This is a noble battle you have my friend, one which I have won, one which has gained me much appreciation by the VP/ IT, and my peers. We support a great number of other smaller “tech groups”, and all of them allow Nagios on there server’s and relish in the fact that they never have to check the log files, disc space, Oracle tablespace free, etc. It’s the department that I work in, that has become so proactive, that Nagios is now known/accepted by almost every employee in this company. Only those managers that have stayed steadfast on there “crisis management” policy, that don’t want anything to do with Nagios.

Edited Fri May 13 2005, 07:39AM ]


#5

Veritas is installed on many of our Oracle servers. It provides high availability in many areas. I’ve studied these installations, and I can see nothing that it does, that Nagios can’t do. For example:
Every 5 minutes, Veritas checks to see if a NIC card has failed. If it does, it attempts to restore it, if it fails again, it “online’s” the backup NIC card, and thus providing high availability.
Nagios can be configured to not only monitor, but to take action also. This from the docs “An obvious use for event handlers (especially with services) is the ability for Nagios to proactively fix problems before anyone is notified.”

In fact, I don’t know if Veritas has the checks available that Nagios does. Can it check free space in Oracle tablespaces like Nagios can? I don’t know, but I would much rather spend $0.00 on something, than "$Godknowshowmanythousands for Veritas, if all I want to do is monitor Oracle tablespace free, disc space free, ftpd running, etc.

If they would allow Veritas to be installed, why not Nagios? Does it crash less than Nagios? I’ve never seen Nagios crash. Does Nagios not have support? Can’t be that, since there are many people who will provide support for a fee, and some, like me and others on this forum, for free!!! Veritas is a better quality product? Prove it. I can’t see the difference myself.

I’m not saying Veritas is a lousy product, just the opposite. I compare Veritas to Nagios. They are both wonderful, amazing products, that provide a huge benefit, and well worth any effort or money spent. It’s up to the IT boss’ to figure out if they want to spend thousands or nothing, to get the results they want. Can Nagios give you proactive results? Try it out, it cost’s nothing to find out. Can Veritas give you proactive results? Try it out, it cost’s you $thousands to find out. Or, you could simply do nothing, and “let the customer’s tell you” when something is broken.

Edited Fri May 13 2005, 09:02AM ]


#6

That is my script:

==== cut ====

#!/bin/sh

check_disk.sh

libdir="/usr/local/nagios/libexec"
send_nsca="/usr/local/bin/send_nsca"
send_nsca_cfg="/usr/local/etc/send_nsca.cfg"

nagioshostname=172.16.0.86
nagiosport=5667
hostname=172.16.0.1
servicename=diskspace
checkcommand=“check_disk -w 20% -c 5% -p /”

result=$libdir/$checkcommand
returncode=$?
echo -e “$hostname\t$servicename\t$returncode\t$result” | $send_nsca -H $nagioshostname -p $nagiosport -c $send_nsca_cfg

==== cut ====

It is run by cron every 5 minutes. On the station which is monitored there are only nagios plugins and send_nsca installed.


#7

!!!
Mind that in the “echo” command there should be
\t (backslash t) which means tabulation
between $hostname and $servicename and all other fields, not just “t”. It’s the forum engine that has erased the backslash while posting.


#8

meulie.net/forum_viewtopic.php?21.2367

Please make the above change if you plan on using very many passive checks.


#9

I have a couple of issues when trying to configure distributed monitoring with nsca. Please can someone give me a few clues? My two problems are as follows:

1 - reports sent back to the central server via nsca do not have the check output, they just appear as “;$OUTPUT$|$PERFDATA$”. The status is correct. Where do I define this variable on the sending server?

2 - My hosts are not all available from the central nagios server, so I check ping availability from the boxes with the send_nsca running on them. This reports fine on this box, but the central server needs to run a command to check availability from the hosts.cfg file. How do I stop this form happening?

Many thanks,
Ian


#10

I’m guessing that it’s your checkcommands.cfg definition.
This is my checkcommands.cfg definition on remote host/distributed server.

define command{
command_name submit_check_result
command_line $USER1$/eventhandlers/submit_check_result $HOSTNAME$ ‘$SERVICEDESC$’ $SERVICESTATE$ ‘$OUTPUT$|$PERFDATA$’
}
I’m guessing that yours looks like …
$USER1$/eventhandlers/submit_check_result $HOSTNAME$ ‘$SERVICEDESC$’ $SERVICESTATE$ ;’$OUTPUT$|$PERFDATA$’

For #2 problem, you don’t have the central server setup corretly. It is not supposed to be performing active checks on those hosts being checked by the distributed server.
nagios.sourceforge.net/docs/1_0/distributed.html

The central server has the web interface installed (optional, but recommended)

The central server has its enable_notifications directive set to 1. This will enable notifications. (optional, but recommended)

The central server has active service checks disabled (optional, but recommended - see notes below)

The central server has external command checks enabled (required)

The central server has passive service checks enabled (required)

* The central server must have service definitions for all services that are being monitored by all the distributed servers. Nagios will ignore passive check results if they do not correspond to a service that has been defined.
* If you're only using the central server to process services whose results are going to be provided by distributed hosts, you can simply disable all active service checks on a program-wide basis by setting the execute_service_checks directive to 0. If you're using the central server to actively monitor a few services on its own (without the aid of distributed servers), the enable_active_checks option of the defintions for service being monitored by distributed servers should be set to 0. This will prevent Nagios from actively checking those services. 

Upgrade to 2.0 if you really need to, since it will do passive host checks.
nagios.sourceforge.net/docs/2_0/ … hecks.html

If there is absolutely no way to get some kind of active check on the central server for a host check to this “behind a firewall” box, then 2.0 is about all you can do.

This from the 2.0 hosts.cfg file
active_checks_enabled [0/1]
passive_checks_enabled [0/1]

This from the 1.2 hosts.cfg file
checks_enabled [0/1]


#11

Thanks! That’s sorted problem 2. I’m still having problems with the first issue though. My submit_check_result command is exactly like yours, so it can’t be a problem with that, and that submit_check_results script is the same as what you pasted in this thread, obviously with the ip and file locations altered. An example of the string being sent back to the central server is “server-1.tower-13 ‘check_smtp’ OK ‘$OUTPUT$|$PERFDATA$’”, so the output/perfdata variables just aren’t being recognised. Do you have any other ideas? Thanks, Ian


#12

p.s. - I am using version 2.0


#13

Ah, sorted it out now. Sorry for all the posts! In version 2 you need to use ‘$SERVICEOUTPUT$|$SERVICEPERFDATA$’ or ‘$HOSTOUTPUT$|$HOSTPERFDATA$’. Thanks for the help anyway!!


#14

That’s some good info, thanks for sharing.


#15

I have an additional question concerning NSCA

I use nsca the traditional way for months, ie nsca-daemon running on the central Nagios server and send_nsca on the distributed server to send service check results to the central server.
Everything works OK.

Now I want to catch SNMP traps on the central server and therefore I have to submit my trap to the local nsca-daemon.

However, on the local machine, when I execute : echo “testbox\tTRAP\t1\tTEST\n” | /opt/nagios/bin/send_nsca ‘server-ip’ -c /opt/nagios/etc/send_nsca.cfg
I get : 0 data packet(s) sent to host succesfully

If I execute the same command from my distributed server, I get 1 data packet(s) sent to host succesfully

Any idea, why this doesn’t work locally ? I have put 127.0.0.1 and ‘server-ip’ in the allowed hosts-section of my nsca.cfg


#16

Here is a correction, in case I stated in wrong somewhere else. What was happening with nsca is that I was getting flooded with connections and xinetd would shut them down, due to excessive connections. To fix this, change:
vi /etc/xinetd.d/nsca
make it like this:
service nsca
{
flags = REUSE
socket_type = stream
wait = no
user = nagios
group = nagios
server = /usr/local/nagios/bin/nsca
server_args = -c /usr/local/nagios/etc/nsca.cfg --inetd
cps = 9000 30
instances = UNLIMITED #This is what is important here!!!
log_on_failure += USERID
disable = no
only_from = IP1 ip2
}

Edited Sat Oct 22 2005, 06:25AM ]


#17

Hi!

I’d like to receive SNMP traps from a switch with Nagios across NSCA addon. I’ve read the online documentation about this topic, and followed the instructions how to take Nagios able to handle the SNMP traps. I’ve configured the snmptrapd, written my own handle script. But it doesn’t work fine.

I can see this in the nagios.log, when a trap is sent out:

[1132059606] EXTERNAL COMMAND: PROCESS_SERVICE_CHECK_RESULT;10.116.180.1;something;2;error

But I don’t get any alerts about the problem

If I submit a passive check result manually for this service, i see these lines in the log:

[1132060206] EXTERNAL COMMAND: PROCESS_SERVICE_CHECK_RESULT;10.116.180.1;something;2;problem|
[1132060206] SERVICE ALERT: 10.116.180.1;something;CRITICAL;HARD;1;problem

I get alerts, when I do it manually, and I can see changing the state on the graphical interface.

When a trap causes the CHECK_RESULT, there is no | (pipe) sign at the end of the line, and there is no SERVICE ALERT line in the log too. So I don’t get any alerts.

What is causing this problem? Could you help me?
Thanks!

(I’m sorry for my English. I hope you understand what I mean.)

Edited Tue Nov 15 2005, 01:09PM ]


#18

PROCESS_SERVICE_CHECK_RESULT;10.116.180.1;something;2;error should end with the | pipe, so you are right. Modify your script so that is what is submitted.


#19

So this time it’s me…
after usingnagios for quite a while i got a bti of time to test NCSA :slight_smile:

Local nagios server works well, set up a new nagios install on a remote server, installe libmcrypt on both, configured and compiled nsca, all well.

Following the instructions in this thread i got most of it stitched together…

if i log on the remote server as user nagios i can send nsca data to the local server, i find it in the event log.

problem is the remote server doesn’t send the results: in syslog i find this:

nagios: Warning: OCSP command ‘/usr/local/nagios/libexec/eventhandlers/submit_check_result logger ‘Current Load’ OK ‘OK - load average: 0.00, 0.11, 0.15|load1=0.000;5.000;10.000;0; load5=0.110;4.000;6.000;0; load15=0.150;3.000;4.000;0;’’ for service ‘Current Load’ on host ‘logger’ timed out after 5 seconds

any idea what could be wrong?

Thank you. Luca

EDIT: got it… simple chair/keyboard interface error… copying scripts is good. not checking them is bad… i was using jakkedup’s server private Ip address… :stuck_out_tongue:
Edited Sun Jan 22 2006, 11:05PM ]


#20

made a couple more tests…
problem (as far as i can understand it) is that the send_nsca is not exiting correctly.

data i type into the running script gets sent to the central server, but afetr 10 seconds th escript times out and the entral server logs an error “End of transmission…”…

The data looks correctly formatted with tabs and newline at the end (tried outputting it with echo onto a regulare file and i ha a newline at the end). Problem is the newline does NOT terminate the send_nsca scritp…

i get a timeout on the remote server (running from command line) and a syslog error “End of transmission…” on the central server.

ANy ideas?

Thanks, Luca

ADDED INFO: If i run send_nsca from command line i get somthing to the nagios_log when nagios runs it i don’t get anyting to the nagios_log even if from the debug info i see the connection ha been etablished.

MORE INFO: seg fault comes using triple des, testing without password is way better… will try to find an encyption method without errors. On the other hand i needed to add the -e switch to /bin/echo so that tabs and newlines are recgnized. :slight_smile:

Edited Tue Jan 24 2006, 04:41PM ]